project-receptor / python-receptor Goto Github PK

Project Receptor is a flexible multi-service relayer with remote execution and orchestration capabilities linking controllers with executors across a mesh of nodes.

License: Other

Python 96.44% Makefile 1.43% Dockerfile 0.68% Shell 0.42% Jinja 1.04%

python-receptor's People

Contributors

Stargazers

Watchers

python-receptor's Issues

Work Module Enhancements

work.py works on a basic level but some changes could make it much more usable and safe from out-of-control plugins. It can also provide a better framework for running plugins in.

Run plugins under a concurrent.futures.Executor (ThreadPoolExecutor for example). This will allow us to track individual execution of the plugins:
- Add a cancel contract instruction to a job invocation. This can be handled by a plugin by catching the thread stop exception.
- A better manifest for presenting what's actually running for introspection through ping or status commands
Abstract the return message queue and asyncio requirements from the plugin
- Currently each plugin needs to implement it's own asyncio.Queue as that's what is expected to be returned from the execution method. Here we can give the plugin either it's own Queue to push into OR a general queue accessible from all worker contexts that can interleave responses.
- Alternatively feeder Queues from each worker that feeds into a general response queue that delivers responses back.
Better testing, performance, and resource usage of the work.py module
- I'd love to be able to report overall resource usage, but on a basic level we need a mechanism for plugins to be able to inform Receptor how much work they are capable of. In the case of Ansible Runner this takes the form of how many cores and how much memory is available (in which case AWX uses that information to inform how many jobs can run).

raw_payload is not being passed properly to plugins

On the send method of controller the envelope.Inner is created and the raw_payload attribute is being populated as message.fd.read(). That is not working because the file pointer is not on the beginning of the file and therefore will always return an empty string. To avoid that the file pointer should be set to 0 (seek(0)).

See below a breakpoint added to the first line of Controller.send and inspecting a payload with payload as content:

> /home/elyezer/code/receptor/receptor/receptor/controller.py(45)send()
-> new_id = uuid.uuid4()
(Pdb) message
<receptor.messages.envelope.Message object at 0x7fa727b2adc8>
(Pdb) message.fd.tell()
7
(Pdb) message.fd.read()
b''
(Pdb) message.fd.seek(0)
0
(Pdb) message.fd.read()
b'payload'

Without seek(0) the other node's plugin will always get an empty string as the raw_payload:

> /home/elyezer/code/receptor/receptor-debug/receptor_debug/debug.py(21)all()
-> return ReceptorDebug(message.receptor)
(Pdb) message.raw_payload
''

The above pdb session was run on the following plugin code:

class ReceptorDebug:
    def __init__(self, receptor):
        self.done = False
        self.receptor = receptor

    def __aiter__(self):
        return self

    async def __anext__(self):
        if self.done:
            raise StopAsyncIteration
        self.done = True
        return debug_info(self.receptor)

def all(message, *args):
    __import__('pdb').set_trace()
    return ReceptorDebug(message.receptor)

The send command was run like the following:

$ receptor -d /tmp/send \
    send \
    --peer=receptor://127.0.0.1:9999 \
    --directive=debug:all \
    controller \
    'payload'

And the controller was run like the following:

$ receptor --debug \
    --node-id=controller \
    -d /tmp/controller \
    controller \
    --listen=receptor://127.0.0.1:9999

Update user and developer documention

Add leaf-node non-routable flag

Implement secure transport and connection validation

Receptor continues running if unable to bind to listen socket address

If I start two receptor processes and tell them to bind on the same socket address (by default: 8888), then one will bind, and the other will emit an error about being unable to bind but continue running.

$ poetry run receptor --data-dir="$(mktemp --directory)" node

$ poetry run receptor --data-dir="$(mktemp --directory)" node
ERROR 2020-03-05 13:57:20,800  controller [Errno 98] error while attempting to bind on address ('0.0.0.0', 8888): address already in use
Traceback (most recent call last):
  File "/home/ichimonji10/code/receptor/receptor/controller.py", line 46, in exit_on_exceptions_in
    await task
  File "/usr/lib/python3.8/asyncio/streams.py", line 94, in start_server
    return await loop.create_server(factory, host, port, **kwds)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1459, in create_server
    raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 8888): address already in use

It's true that some network connectivity issues are unavoidable. For example, the peer specified by the --peer argument might be unavailable. But for me, there's a big difference between "the process can't bind to a socket address" and "the process can't contact a socket address." The former should cause a catastrophic failure for a server, the latter not.

Write docs for end-user usage

I don't think the docs should try to reproduce the CLI help, but we should have a section that gives a high-level description of the available commands and what they do - what is node, send, ping, status, etc.

Wrong error message when incorrect --listen specified

When incorrect value of --listen is specified, the following error is shown:

RuntimeError: Invalid Receptor peer specified: <listen>

Printing the actual value of --listen but mentioning peer instead. Example:

# receptor --config ./config.conf --debug --node-id node -d /tmp/node node --peer=localhost:8881 --listen=tcp://0.0.0.0:8882
INFO 2020-03-19 06:41:51,927 node entrypoints Running as Receptor node with ID: node
ERROR 2020-03-19 06:41:51,927 node __main__ main: an error occured while running receptor
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/receptor/__main__.py", line 59, in main
    config.go()
  File "/usr/lib/python3.6/site-packages/receptor/config.py", line 516, in go
    self._parsed_args.func(self)
  File "/usr/lib/python3.6/site-packages/receptor/entrypoints.py", line 49, in run_as_node
    listen_tasks = controller.enable_server(config.node_listen)
  File "/usr/lib/python3.6/site-packages/receptor/controller.py", line 54, in enable_server
    listener = self.connection_manager.get_listener(url)
  File "/usr/lib/python3.6/site-packages/receptor/connection/manager.py", line 35, in get_listener
    service = parse_peer(listen_url, 'server')
  File "/usr/lib/python3.6/site-packages/receptor/connection/manager.py", line 23, in parse_peer
    raise RuntimeError(f"Invalid Receptor peer specified: {peer}")
RuntimeError: Invalid Receptor peer specified: tcp://0.0.0.0:8882

I understand that --listen value can be considered peer, too, but the error message makes me look for the error in different place then where it actually is (tcp:// protocol in this case).

Work requests are dispatched serially

Even though the WorkManager has a thread pool, it is not dispatching work requests in parallel. The WorkManager.handle() method does not hand control back over to the event loop while checking to see if the worker thread has put something onto the response_queue. This leads to spin loop in the WorkManger.handle() method which leads to 100% cpu usage.

Signal/Shutdown Handler

Handle SIGTERM and internal failures in such a way that existing work is completed and new work is rejected/not fetched before full shutdown.

Dynamically adjust edge weights based on connectivity status and availability

Calling methods by a name provided from the network is inherently insecure

When a message arrives, we take the provided 'action' parameter and use it as a key to gettatr to find the same-named method on the worker object: https://github.com/project-receptor/receptor/blob/22d01dd0a0e988711aa4fd44090c83c31d9b78aa/receptor/work.py#L53

This means other nodes on the mesh can call any method in this class. I'm not sure what exactly you can do with this but I bet you can do something.

We should have some kind of declarative scheme for registering actions, not just assume every method is a valid action.

Provide the ability to set arbitrary headers for the websocket client

It would be nice to be able to set headers that would be used when making the initial connection to a ws:// or wss:// peer.

Expand Routing System to handle node groupings

Use a different library for websockets connections

The aiohttp library does not support https proxy servers, which is something we would like to have. A new library, python-httpx (https://www.python-httpx.org/), appears to support websockets over https proxies. We should investigate whether it is feasible to switch to this library and, if so, whether its ws/wss-over-https implementation will work for us.

PSK support alongside certificate for joining a cluster

Add event_loop fixture

two pytest tests failing right now because event_loop fixture doesn't exist - thought there must be some kind of design around this so didn't want to make assumptions - just keeping here for tracking - would love to get all tests passing

Receptor send is not working and taking longer to send/receive (later with a patch applied)

I tried to run receptor send and it is not working. It is having trouble with bytes vs string and the following patch gets it fixed:

diff --git a/receptor/controller.py b/receptor/controller.py
index 3a9547d..7bab4bf 100644
--- a/receptor/controller.py
+++ b/receptor/controller.py
@@ -50,7 +50,7 @@ class Controller:
             message_type="directive",
             directive=message.directive,
             timestamp=datetime.datetime.utcnow().isoformat(),
-            raw_payload=message.fd.read(),
+            raw_payload=message.fd.read().decode('utf-8'),
         )
         await self.receptor.router.send(inner_env, expected_response=expect_response)
         return new_id
diff --git a/receptor/messages/envelope.py b/receptor/messages/envelope.py
index a5f817d..6c1b34d 100644
--- a/receptor/messages/envelope.py
+++ b/receptor/messages/envelope.py
@@ -37,6 +37,8 @@ class Message:
         self.fd = buffered_io
 
     def data(self, raw_data):
+        if isinstance(raw_data, str):
+            raw_data = raw_data.encode('utf-8')
         self.fd.write(raw_data)

Also it is not returning as it used to. When running the command with the above patch it gets the message but still running and it requires Ctrl-c to exit. Is this expected?

The last thing that I observed is that send is now slower and it is taking about 6 seconds to get a response, it used to be snappier. To reproduce the following topology and try to send a directive to any of the nodes.

# Run the controller
receptor \
    --node-id=controller \
    -d /tmp/controller \
    controller \
    --listen=receptor://127.0.0.1:9999

# Run some nodes
receptor \
    --node-id=node-a \
    -d /tmp/node-a \
    node \
    --listen=receptor://127.0.0.1:9998 \
    --peer=receptor://localhost:9999

receptor \
    --node-id=node-b \
    -d /tmp/node-b \
    node \
    --listen=receptor://127.0.0.1:9997 \
    --peer=receptor://localhost:9998

receptor \
    --node-id=node-c \
    -d /tmp/node-c \
    node \
    --listen=receptor://127.0.0.1:9996 \
    --peer=receptor://localhost:9997

# Try to send a directive to any node
receptor \
    -d /tmp/send \
    send \
    --peer=receptor://127.0.0.1:9999 \
    --directive=debug:all \
    node-c \
    'payload'

Here is what I observed trying the closest and farther nodes. I pressed Ctrl-c as soon as the message was printed on the terminal.

$ time receptor -d /tmp/send send --peer=receptor://127.0.0.1:9999 --directive=debug:all node-c '' 
{'config': {'default_node_id': 'node-c', 'default_config': '/etc/receptor/receptor.conf', 'default_data_dir': '/tmp/node-c', 'default_debug': None, 'auth_ssl_cert': '', 'auth_ssl_key': '', 'node_listen': ['receptor://127.0.0.1:9996'], 'node_peers': ['receptor://localhost:9997'], 'node_server_disable': False, 'node_stats_enable': None, 'node_stats_port': 8889, 'node_keepalive_interval': -1, 'node_groups': [], 'controller_socket_path': '/var/run/receptor_controller.sock', 'controller_listen': ['receptor://0.0.0.0:8888'], 'controller_id': '', 'ping_peer': '', 'controller_stats_enable': None, 'controller_stats_port': 8889, 'ping_count': 0, 'ping_delay': 0.0, 'ping_recipient': '', 'send_peer': '', 'send_directive': '', 'send_recipient': '', 'send_payload': '', 'components_security_manager': '<receptor.security.MallCop object at 0x7f5e21f7b198>', 'components_buffer_manager': '<receptor.buffers.file.FileBufferManager object at 0x7f5e21f7b3c8>', 'plugins': {}}}
^C
receptor -d /tmp/send send --peer=receptor://127.0.0.1:9999  node-c ''  0.47s user 0.05s system 8% cpu 6.238 total


$ time receptor -d /tmp/send send --peer=receptor://127.0.0.1:9999 --directive=debug:all node-a ''
{'config': {'default_node_id': 'node-a', 'default_config': '/etc/receptor/receptor.conf', 'default_data_dir': '/tmp/node-a', 'default_debug': None, 'auth_ssl_cert': '', 'auth_ssl_key': '', 'node_listen': ['receptor://127.0.0.1:9998'], 'node_peers': ['receptor://localhost:9999'], 'node_server_disable': False, 'node_stats_enable': None, 'node_stats_port': 8889, 'node_keepalive_interval': -1, 'node_groups': [], 'controller_socket_path': '/var/run/receptor_controller.sock', 'controller_listen': ['receptor://0.0.0.0:8888'], 'controller_id': '', 'ping_peer': '', 'controller_stats_enable': None, 'controller_stats_port': 8889, 'ping_count': 0, 'ping_delay': 0.0, 'ping_recipient': '', 'send_peer': '', 'send_directive': '', 'send_recipient': '', 'send_payload': '', 'components_security_manager': '<receptor.security.MallCop object at 0x7f1273849e10>', 'components_buffer_manager': '<receptor.buffers.file.FileBufferManager object at 0x7f1273856080>', 'plugins': {}}}
^C
receptor -d /tmp/send send --peer=receptor://127.0.0.1:9999  node-a ''  0.44s user 0.05s system 8% cpu 6.043 total

Reduce verbosity of connection errors in non-debug mode.

This is a bit too noisy for normal operation. In debug mode we could show the full exception but for normal operation a briefer mention would be better.

Receptor doesn't fail gracefully when the data directory is not writable

Trying to run receptor ping and not specifying a writable data directory it fails with the following message:

$ receptor ping foo
ERROR 2020-03-30 16:29:32,504  __main__ main: an error occured while running receptor
Traceback (most recent call last):
  File "/home/elyezer/code/receptor/receptor/receptor/entrypoints.py", line 102, in run_as_ping
    controller = Controller(config)
  File "/home/elyezer/code/receptor/receptor/receptor/controller.py", line 33, in __init__
    self.receptor = Receptor(config)
  File "/home/elyezer/code/receptor/receptor/receptor/receptor.py", line 110, in __init__
    os.makedirs(os.path.join(self.config.default_data_dir, self.node_id))
  File "/usr/lib64/python3.7/os.py", line 211, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib64/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/var/lib/receptor'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/elyezer/code/receptor/receptor/receptor/__main__.py", line 63, in main
    config.go()
  File "/home/elyezer/code/receptor/receptor/receptor/config.py", line 538, in go
    self._parsed_args.func(self)
  File "/home/elyezer/code/receptor/receptor/receptor/entrypoints.py", line 105, in run_as_ping
    controller.cleanup_tmpdir()
UnboundLocalError: local variable 'controller' referenced before assignment

It would be better to avoid showing the UnboundLocalError and state what was the issue.

With this we can bring a conversation about if ephemeral node related commands such as ping should try to create a temporary data directory. If this is a good idea we can open a separate issue and track this enhancement separately from this one.

Transient commands should clean up after themselves

Some of the cli commands are transient (ping, send, status, etc). These commands create and persist a directory and manifest record in the run directory for receptor when they don't need to. We could use tmp or just force cleanup the directories that receptor normally creates so that it doesn't pollute the system.

Receptor does not handle exceptions properly when starting the server

Receptor is ignoring any exception raised when starting the server and continue if everything was alright. It should check if it could start the server before trying to do anything else.

To reproduce this consider that you got a controller running on port 9999 by running the following command:

$ receptor --debug \
    --node-id=controller \
    -d /tmp/controller \
    controller \
    --listen=receptor://127.0.0.1:9999

Then try to start a node and also make it listen on port 9999:

$ receptor --debug \
    --node-id=node-a \
    -d /tmp/node-a \
    node \
    --listen=receptor://127.0.0.1:9999 \
    --peer=receptor://localhost:9999

INFO 2020-02-07 14:48:08,722 node-a entrypoints Running as Receptor node with ID: node-a
INFO 2020-02-07 14:48:08,722 node-a controller Serving on receptor://127.0.0.1:9999
INFO 2020-02-07 14:48:08,723 node-a controller Connecting to peer receptor://localhost:9999
Task exception was never retrieved
future: <Task finished coro=<start_server() done, defined at /usr/lib64/python3.6/asyncio/streams.py:86> exception=OSError(98, "error while attempting to bind on address ('127.0.0.1', 9999): address already in use")>
Traceback (most recent call last):
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 1073, in create_server
    sock.bind(sa)
OSError: [Errno 98] Address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/asyncio/streams.py", line 119, in start_server
    return (yield from loop.create_server(factory, host, port, **kwds))
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 1077, in create_server
    % (sa, err.strerror.lower()))
OSError: [Errno 98] error while attempting to bind on address ('127.0.0.1', 9999): address already in use
DEBUG 2020-02-07 14:48:08,727 node-a base sending HI
DEBUG 2020-02-07 14:48:08,728 node-a base waiting for HI
DEBUG 2020-02-07 14:48:08,730 node-a base sending routes
DEBUG 2020-02-07 14:48:08,730 node-a receptor Emitting Route Advertisements, excluding set()
DEBUG 2020-02-07 14:48:08,731 node-a base starting normal loop
DEBUG 2020-02-07 14:48:08,732 node-a receptor spawning message_handler
DEBUG 2020-02-07 14:48:08,733 node-a receptor message_handler: FramedMessage(msg_id=107637885988034463241505519927166033911, header={'cmd': 'ROUTE', 'id': 'controller', 'capabilities': [['receptor_sleep', '1.0.0']], 'groups': [], 'edges': [['controller', 'node-a', 1]], 'seen': ['controller', 'node-a']}, payload=None)
DEBUG 2020-02-07 14:48:08,733 node-a receptor Emitting Route Advertisements, excluding {'node-a', 'controller'}

So it tried to bind on a address that was already in use but never checked that it failed to connect and continued as if everything was fine. When the above happen the following was shown on the controller logs:

DEBUG 2020-02-07 14:48:08,727 controller base waiting for HI
DEBUG 2020-02-07 14:48:08,728 controller base sending HI
DEBUG 2020-02-07 14:48:08,729 controller base sending routes
DEBUG 2020-02-07 14:48:08,729 controller receptor Emitting Route Advertisements, excluding set()
DEBUG 2020-02-07 14:48:08,731 controller base starting normal loop
DEBUG 2020-02-07 14:48:08,731 controller receptor spawning message_handler
DEBUG 2020-02-07 14:48:08,735 controller receptor message_handler: FramedMessage(msg_id=221019859519773369567608306767646276616, header={'cmd': 'ROUTE', 'id': 'node-a', 'capabilities': [['receptor_sleep', '1.0.0']], 'groups': [], 'edges': [['controller', 'node-a', 100]], 'seen': ['controller', 'node-a']}, payload=None)
DEBUG 2020-02-07 14:48:08,736 controller receptor Emitting Route Advertisements, excluding {'controller', 'node-a'}
DEBUG 2020-02-07 14:48:08,738 controller receptor message_handler: FramedMessage(msg_id=111384449847426542261153884145803424085, header={'cmd': 'ROUTE', 'id': 'node-a', 'capabilities': [['receptor_sleep', '1.0.0']], 'groups': [], 'edges': [['controller', 'node-a', 1]], 'seen': ['node-a', 'controller']}, payload=None)
DEBUG 2020-02-07 14:48:08,738 controller receptor Emitting Route Advertisements, excluding {'controller', 'node-a'}

So the node-a connected to the controller and it was considered to be a node of the network no problem since it connected to the controller but it could not serve since the address was already in use.

On the scenario above you can actually send work to node-a but you can't have nodes connected to it since it failed to start the listener server. In other words it handler the --peer option no problem but failed to properly handle the --listen.

With that said, should receptor fail in case any of the expected behavior fails to start? Maybe failing only if --listen fails is the way to go since --peer could point to a --peer that is temporarily unavailable and will come back at some point. Not sure here but getting receptor running even though it had initialization issues may bring some headache in the future, specially when trying to debug things.

Implement message persistence

Receptor poetry lock file is incompatible with older versions of poetry

The current lock file requires a newer version of poetry.

With version 0.12.17 installed I received an error trying to install receptor, after upgrading to a newer version of poetry 1.0.2, the install completed successfully.

[HIGH] - When a node disconnects - bad things occur on controller

The pings had finished, looks like something happens to the Transport to make it None

DEBUG 2019-12-10 08:37:24,905 controller router Forwarding frame 296341455170759310604991198164716191918 to ping_node
DEBUG 2019-12-10 08:37:25,234 controller base waiting for HI
DEBUG 2019-12-10 08:37:25,235 controller base sending HI
DEBUG 2019-12-10 08:37:25,235 controller base sending routes
DEBUG 2019-12-10 08:37:25,235 controller receptor Emitting Route Advertisements, excluding set()
ERROR 2019-12-10 08:37:25,247 controller base watch_queue: error received trying to write
Traceback (most recent call last):
  File "/home/psavage/workspace/receptor/receptor/connection/base.py", line 39, in watch_queue
    await conn.send(msg)
  File "/home/psavage/workspace/receptor/receptor/connection/sock.py", line 31, in send
    await self.writer.drain()
  File "/usr/lib64/python3.7/asyncio/streams.py", line 348, in drain
    await self._protocol._drain_helper()
  File "/usr/lib64/python3.7/asyncio/streams.py", line 202, in _drain_helper
    raise ConnectionResetError('Connection lost')
ConnectionResetError: Connection lost
INFO 2019-12-10 08:37:25,249 controller receptor Removing connection <receptor.connection.sock.RawSocket object at 0x7fa324363d50> for node ping_node
Task exception was never retrieved
future: <Task finished coro=<serve() done, defined at /home/psavage/workspace/receptor/receptor/connection/sock.py:51> exception=TypeError("object NoneType can't be used in 'await' expression")>
Traceback (most recent call last):
  File "/home/psavage/workspace/receptor/receptor/connection/base.py", line 39, in watch_queue
    await conn.send(msg)
  File "/home/psavage/workspace/receptor/receptor/connection/sock.py", line 31, in send
    await self.writer.drain()
  File "/usr/lib64/python3.7/asyncio/streams.py", line 348, in drain
    await self._protocol._drain_helper()
  File "/usr/lib64/python3.7/asyncio/streams.py", line 202, in _drain_helper
    raise ConnectionResetError('Connection lost')
ConnectionResetError: Connection lost

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/psavage/workspace/receptor/receptor/connection/sock.py", line 53, in serve
    await factory().server(t)
  File "/home/psavage/workspace/receptor/receptor/connection/base.py", line 123, in server
    await self.start_processing()
  File "/home/psavage/workspace/receptor/receptor/connection/base.py", line 98, in start_processing
    return await self.write_task
  File "/home/psavage/workspace/receptor/receptor/connection/base.py", line 43, in watch_queue
    return await conn.close()
TypeError: object NoneType can't be used in 'await' expression
DEBUG 2019-12-10 08:37:25,251 controller receptor Emitting Route Advertisements, excluding set()
ERROR 2019-12-10 08:37:25,251 controller receptor message_handler
Traceback (most recent call last):
  File "/home/psavage/workspace/receptor/receptor/receptor.py", line 93, in message_handler
    data = await buf.get()
  File "/home/psavage/workspace/receptor/receptor/messages/envelope.py", line 157, in get
    return await self.q.get()
  File "/usr/lib64/python3.7/asyncio/queues.py", line 159, in get
    await getter
concurrent.futures._base.CancelledError
DEBUG 2019-12-10 08:37:25,258 controller base starting normal loop

Implement file-backed buffers

Currently data is persisted in memory throughout the entire Receptor pipeline.

This isn't going to work for large payloads that need to be delivered across the mesh, it would be better to persist messages to disk and relay them in chunks over the network if they are on the larger size. Smaller messages could still be delivered as data to the plugin. This could be determined by some metadata about the plugin itself.

Right now we do persist data to disk but it's serialized as an individual json payload, this requires reading the entire payload into memory to (de)serialize it.

This also changes the basic message format of what's sent across the wire.

Better CLI usage information output

Running receptor with no subcommand raises the following:

$ receptor
ERROR 2020-03-30 16:33:45,492  __main__ main: an error occured while running receptor
Traceback (most recent call last):
  File "/home/elyezer/code/receptor/receptor/receptor/__main__.py", line 63, in main
    config.go()
  File "/home/elyezer/code/receptor/receptor/receptor/config.py", line 535, in go
    "you must specify a subcommand (%s)." % (", ".join(SUBCOMMAND_EXTRAS.keys()),)
receptor.exceptions.ReceptorRuntimeError: you must specify a subcommand (node, ping, send, status).

It would be better if receptor, as others CLI tools, provided the usage information if the minimum expected options/subcommands were not provided.

Properly handle message submission timeouts

RFE: Receptor node “relay security level” feature

This is a general request for enhancement within the receptor mesh to address the fundamental expectation of directionality of traffic in a traditional multi-layer network design.

Apologies in advance for the length of this message.

Situation being considered:

Every node in the receptor mesh can perform any role, these being controller, relay and worker.
Once a node has joined the mesh, they have full rights.
If you compromise a node, you compromise the mesh.

In a traditional on-prem/data center style network there’s frequently the simplification that traffic heading outwards is fine, traffic heading inwards is not. This is well explained at this reference, using Cisco ASA interface security levels as the example: https://geek-university.com/ccna-security/asa-security-levels-explained/

Note: one difference here is that interfaces in a Cisco ASA firewall are by necessity the boundaries BETWEEN security levels, and as that URL notes traffic does not flow by default between interfaces with the same security level. In our receptor mesh example this does not hold true, and nodes are contained INSIDE given security levels and nodes with the same security level must be able to relay traffic between themselves.

For this example use case with the expectation that traffic should freely flow outwards through network zones but not inwards, everything is thrown out of the window once receptor mesh is installed.

Let’s rework that in the form of a problem to solve:

Problem statement:

In a traditional DMZ network it is usually assumed that no traffic is free to flow back inside the core.
Once receptor mesh is in place, this no longer holds true. We lose defence in depth, and we depend entirely on the security of EVERY receptor node to maintain the security of the system.
Worse, as the size of the mesh increases, the security of the system gets weakened.

The concept behind the proposed enhancement is to grant every node a “relay security level” as an immutable element of their registration into the node. For the sake of this example I propose simply an integer in the range 0..255.

Because any node can perform any or all of the three roles, we take a very simply approach to security levels and state this:
Any node will only receive a WORK ORDER message from another node if the sender has an equal, or higher, security level than their own.

In our simplest case we’ll see that if all nodes have the same security level, there is no change in behaviour. All nodes are fully trusted.

In a simple demonstration (the network diagram from the Cisco security level URL), we can assign level 100 to all nodes in a core network, and level 50 to all nodes in the DMZ. All nodes within the core network are mutually trusted. All nodes within the DMZ are mutually trusted. Any node in the DMZ can receive a work order from the core. No node in the core will receive a work order from the DMZ. Directionality is created. (see the notes at the end for the nuance that this must only to work orders; result messages MUST be permitted to flow in reverse).

Two additional, sample network scenarios are worth mentioning.

First, a network with two concentric circles of DMZ. Apply levels 100, 50, 25 and the directionality gets extended in the same sequence as one would expect.
Second, a network with two independent DMZs. Apply levels 100, 50, 50. So long as there isn’t a direct connection between the two DMZs the only way to route a message between them is through the core, and is now impossible since it would require sending a message THROUGH a higher security level even through the final recipient is at the same level.

The result is better containment. Even if a DMZ is fully breached, and the receptor node is breached, the compromise is limited to the hosts within the perimeter of that security level. No other zones, not the core, not other elements. Separately interesting elements come into play such as whether I can now DoS the core by sending spurious result messages, but spoofed work orders can not be sent.

Additional thoughts:

One important difference between receptor mesh and firewalls is that in the Cisco ASA firewall example URL, the interfaces are by necessity the boundaries BETWEEN security levels. As that URL notes traffic does not flow by default between interfaces with the same security level. In our receptor mesh example this does not hold true, and nodes are contained INSIDE given security levels and nodes with the same security level must be able to relay traffic between themselves.

A special case of “security level 0” will add value. This is to say, “this node is not allowed to relay work requests to anyone”, or effectively to flag the node as entirely untrusted. This might apply for a hub and spoke model where only one node is present within an untrusted network. As soon as two nodes are present in that untrusted network, they will need a non-zero (but presumed small) integer so as to allow them to relay work between themselves.

This control over relaying messages must only apply to WORK ORDERS. Any results that are to be returned to the sender must be accepted. It would be prudent to be careful in the processing of result messages as they may have originated in a hostile network zone. At the least, avoidance of any type of buffer overrun or data mishandling; confirmation that the result correlates with a valid and recent work order; etc.

A different use-case of a hub and spoke network with a single bastion node and multiple independent cloud virtual networks would also benefit from the same design. A single, centralized, and secured bastion node(s) can send messages out along any spoke, but there’s no call for spokes to send messages between themselves.

Where/how to embed this security level is left as an exercise to the reader. Perhaps embedded as a data element in the mTLS certs that it presents to the network would be one place to consider.

In consideration of the DMZ model we should also consider directionality of registration to the node. In a worst case situation we cannot permit a DMZ node to become broached, and merely re-register itself with the mesh at a higher security level. This, too, is left as an exercise to the reader.

Receptor seems to be sending response to wrong nodes

Considering a 3 nodes mesh:

$ poetry run receptor --debug --node-id=controller -d /tmp/controller controller --listen=receptor://127.0.0.1:9999

$ poetry run receptor --debug --node-id=node-a -d /tmp/node-a node --listen=receptor://127.0.0.1:9998 --peer=receptor://localhost:9999

$ poetry run receptor --debug --node-id=node-b d /tmp/node-b node --listen=receptor://127.0.0.1:9997 --peer=receptor://localhost:9998

Run two ping process in parallel connecting to the controller node, where one targets node-b and the other targets node-c. Observe on the output that WARNING messages are raised saying that a response was received but without having the send record. Also the ping processes continued to be running which means the finished message was not received:

$ poetry run receptor -d /tmp/ping-b ping --peer=receptor://127.0.0.1:9999 --delay 0 --count 10 node-b  
{"initial_time": "2020-03-06T18:27:28.120859", "response_time": "2020-03-06 18:27:28.156398", "active_work": []}
{"initial_time": "2020-03-06T18:27:28.125913", "response_time": "2020-03-06 18:27:28.172401", "active_work": []}
{"initial_time": "2020-03-06T18:27:28.133262", "response_time": "2020-03-06 18:27:28.188726", "active_work": []}
WARNING 2020-03-06 13:27:28,207  receptor Received response to 9762e4e9-3fba-414c-9e4f-496ca0382634 but no record of sent message.
{"initial_time": "2020-03-06T18:27:28.140208", "response_time": "2020-03-06 18:27:28.205756", "active_work": []}
WARNING 2020-03-06 13:27:28,232  receptor Received response to 0e5de5d0-9c3e-40e0-b0db-3672c1b3399b but no record of sent message.
WARNING 2020-03-06 13:27:28,244  receptor Received response to 327ec7dc-d779-451e-988d-ba01e99bb36d but no record of sent message.
WARNING 2020-03-06 13:27:28,250  receptor Received response to c7547678-997e-4c38-8b29-d243d9e0ca4b but no record of sent message.
{"initial_time": "2020-03-06T18:27:28.163389", "response_time": "2020-03-06 18:27:28.237886", "active_work": []}
{"initial_time": "2020-03-06T18:27:28.170925", "response_time": "2020-03-06 18:27:28.246019", "active_work": []}
^C

$ poetry run receptor -d /tmp/ping-a ping --peer=receptor://127.0.0.1:9999 --delay 0 --count 10 node-a  
{"initial_time": "2020-03-06T18:27:28.124886", "response_time": "2020-03-06 18:27:28.158935", "active_work": []}
{"initial_time": "2020-03-06T18:27:28.130734", "response_time": "2020-03-06 18:27:28.164155", "active_work": []}
{"initial_time": "2020-03-06T18:27:28.138053", "response_time": "2020-03-06 18:27:28.179573", "active_work": []}
{"initial_time": "2020-03-06T18:27:28.146100", "response_time": "2020-03-06 18:27:28.192977", "active_work": []}
{"initial_time": "2020-03-06T18:27:28.158007", "response_time": "2020-03-06 18:27:28.206066", "active_work": []}
{"initial_time": "2020-03-06T18:27:28.162428", "response_time": "2020-03-06 18:27:28.211155", "active_work": []}
WARNING 2020-03-06 13:27:28,237  receptor Received response to acccd167-2361-4b85-8d86-655bf1c70489 but no record of sent message.
WARNING 2020-03-06 13:27:28,247  receptor Received response to 2624d5ed-6815-481a-b16a-ef13844087ac but no record of sent message.
WARNING 2020-03-06 13:27:28,252  receptor Received response to 2248820e-0009-42f6-91b6-93cce83ef8df but no record of sent message.
WARNING 2020-03-06 13:27:28,256  receptor Received response to ebd639c8-1d23-4757-8270-247a1b357257 but no record of sent message.
^C

Configuration infrastructure for plugins

For upcoming Satellite plugin we need a way, how to configure receptor. Also in this case, we need multiple receptor instances on the same host. We need to specify following

Satellite URL (probably localhost, but most likely we'll need to use FQDN due to SSL certificate)
username
password
path to Satellite certificate, path to CA (we generate self-signed certs, they don't have to be trusted)

We discussed two options - either we take care of it or receptor provides some standard way to do it. I'd prefer the later as there are other options that the plugin will need to take into consideration and that are coming from cloud.redhat.com side - live output (true/false), refresh interval (seconds).

Just to have a full picture of Satellite use-case. There is also a need for receptor configuration as well, it needs to be pointed to certificate/private key it uses when calling to cloud.redhat.com.

Connection lost gets receptor network on a not working state

Considering a network like controller -> node-a -> node-b -> node-c and then running ping every second using controller as the ping peer node to ping node-c. Then stop either node-b or node-a and then after a while bringing it back on, the network will stop working and ping won't just come back to work. Also the following stack trace will be shown. To get the network to work again all the nodes needs to be restarted.

ERROR 2019-12-16 13:32:40,275 node-b base watch_queue: error received trying to write
Traceback (most recent call last):
  File "/home/elyezer/code/receptor/receptor/receptor/connection/base.py", line 39, in watch_queue
    await conn.send(msg)
  File "/home/elyezer/code/receptor/receptor/receptor/connection/sock.py", line 31, in send
    await self.writer.drain()
  File "/usr/lib64/python3.6/asyncio/streams.py", line 339, in drain
    yield from self._protocol._drain_helper()
  File "/usr/lib64/python3.6/asyncio/streams.py", line 210, in _drain_helper
    raise ConnectionResetError('Connection lost')
ConnectionResetError: Connection lost
Task exception was never retrieved
future: <Task finished coro=<serve() done, defined at /home/elyezer/code/receptor/receptor/receptor/connection/sock.py:51> exception=TypeError("object NoneType can't be used in 'await' expression",)>
Traceback (most recent call last):
  File "/home/elyezer/code/receptor/receptor/receptor/connection/base.py", line 39, in watch_queue
    await conn.send(msg)
  File "/home/elyezer/code/receptor/receptor/receptor/connection/sock.py", line 31, in send
    await self.writer.drain()
  File "/usr/lib64/python3.6/asyncio/streams.py", line 339, in drain
    yield from self._protocol._drain_helper()
  File "/usr/lib64/python3.6/asyncio/streams.py", line 210, in _drain_helper
    raise ConnectionResetError('Connection lost')
ConnectionResetError: Connection lost

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/elyezer/code/receptor/receptor/receptor/connection/sock.py", line 53, in serve
    await factory().server(t)
  File "/home/elyezer/code/receptor/receptor/receptor/connection/base.py", line 128, in server
    await self.start_processing()
  File "/home/elyezer/code/receptor/receptor/receptor/connection/base.py", line 103, in start_processing
    return await self.write_task
  File "/home/elyezer/code/receptor/receptor/receptor/connection/base.py", line 43, in watch_queue
    return await conn.close()
TypeError: object NoneType can't be used in 'await' expression

Unclear response when sending bad directive

when sending this directive:

receptor send --socket-path=/tmp/receptor.sock --directive=receptor-http.execute node-a '{"method": "GET", "url": "http://localhost:3002/api/sources/v1.0/sources"}'

this response in returned:

{"message_id": "644cdb7c-384c-4f33-bfe7-e5c8d8acebe3", "sender": "node-a", "recipient": "controller", "message_type": "response", "timestamp": "2019-09-18T13:30:23.494909", "raw_payload": "not enough values to unpack (expected 2, got 1)", "directive": null, "in_response_to": "d906defc-0699-4f98-8108-a416b1523c52", "ttl": 15, "serial": 2, "code": 1}

debug log from node:

DEBUG 2019-09-18 15:32:59,823 protocol b'{"frame_id": "eb388e83-45fa-4756-824f-d31e892cefa6", "sender": "controller", "recipient": "node-a", "route_list": ["controller", "controller"], "inner": "{\\"message_id\\": \\"e8659027-8b6c-4329-8483-dc224d7c8cdd\\", \\"sender\\": \\"controller\\", \\"recipient\\": \\"node-a\\", \\"message_type\\": \\"directive\\", \\"timestamp\\": \\"2019-09-18T13:32:59.807935\\", \\"raw_payload\\": \\"{\\\\\\"method\\\\\\": \\\\\\"GET\\\\\\", \\\\\\"url\\\\\\": \\\\\\"http://localhost:3002/api/sources/v1.0/sources\\\\\\"}\\", \\"directive\\": \\"receptor-http.execute\\", \\"in_response_to\\": null, \\"ttl\\": null, \\"serial\\": 1, \\"code\\": 0}"}\x1b[K'
DEBUG 2019-09-18 15:32:59,886 router Shortest path to controller with cost 1 is ['controller', 'node-a']
DEBUG 2019-09-18 15:32:59,887 router Sending 351bdef4-c45d-4e44-a3dc-80f3ea4e355c to controller via controller
DEBUG 2019-09-18 15:32:59,887 router Forwarding frame e720601f-ce93-4dbe-add9-75f383c4b17a to controller

The --directive should be "receptor_http:execute" instead of "receptor-http.execute"

If `--listen` has invalid scheme, receptor will throw traceback

If receptor is started and an invalid scheme is given to a --listen flag, then receptor will crash with an unhelpful traceback:

$ poetry run receptor --data-dir="$(mktemp --directory)" node --listen='harglebargle://127.0.0.1:8888'
ERROR 2020-02-18 10:36:39,398  __main__ main: an error occured while running receptor
Traceback (most recent call last):
  File "/home/ichimonji10/code/receptor/receptor/__main__.py", line 59, in main
    config.go()
  File "/home/ichimonji10/code/receptor/receptor/config.py", line 477, in go
    self._parsed_args.func(self)
  File "/home/ichimonji10/code/receptor/receptor/entrypoints.py", line 49, in run_as_node
    controller.enable_server(config.node_listen)
  File "/home/ichimonji10/code/receptor/receptor/controller.py", line 33, in enable_server
    self.loop.create_task(listener)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 431, in create_task
    task = tasks.Task(coro, loop=self, name=name)
TypeError: a coroutine was expected, got None

It is correct for receptor to bail, but the traceback shown is unhelpful. It would be better if receptor printed a concise message that in some way pointed to the --listen flag, rather than complaining about the lack of a coroutine.

Node emits traceback when failing to connect to peer

When a node starts, it can be told to connect to a peer. It's expected that a node might be unable to connect to a node, at least not right away. In this case, the node will simply sleep for a while, then re-try connecting.

When a node fails to connect to a peer, I expect it to either handle this silently, or with a terse message like:

Failed to connect to 127.0.0.1:8889. Will retry in 5 seconds.

In reality, a node will handle this error by emitting a traceback:

$ receptor --data-dir="$(mktemp --directory)" node --peer='127.0.0.1:8889'
ERROR 2020-02-05 14:15:13,240  sock sock.connect
Traceback (most recent call last):
  File "/home/ichimonji10/code/receptor/receptor/connection/sock.py", line 40, in connect
    r, w = await asyncio.open_connection(host, port, loop=loop, ssl=ssl)
  File "/usr/lib/python3.8/asyncio/streams.py", line 52, in open_connection
    transport, _ = await loop.create_connection(
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1021, in create_connection
    raise exceptions[0]
  File "/usr/lib/python3.8/asyncio/base_events.py", line 1006, in create_connection
    sock = await self._connect_sock(
  File "/usr/lib/python3.8/asyncio/base_events.py", line 920, in _connect_sock
    await self.sock_connect(sock, address)
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 494, in sock_connect
    return await fut
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 526, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [Errno 111] Connect call failed ('127.0.0.1', 8889)

IMO, emitting tracebacks for expected errors that are being caught and handled is problematic behaviour:

Messages such as this are likely to generate alarm in end users. This might manifest in any number of negative ways:
- Users might file spurious bugs against receptor.
- Users might conclude that receptor is a poor quality product and abandon it, or become less satisfied with any product that makes use of it.
Messages such as this lower the signal to noise ratio, making it harder for QE to zero in on more important information, and possibly obscuring other tracebacks that genuinely signal problems.

Can receptor respond to this expected operating condition in a less-alarming and more terse manner?

Nodes with same RECEPTOR_NODE_ID can exist on the same mesh

When starting a receptor node the RECEPTOR_NODE_ID environment variable can be set and receptor will use as the node ID. The problem with that is: there is no validation that the specified node ID is not being used by another node in the mesh.

When more than one node has the same ID the message routing does not work as expected and therefore can lead to message loss since a message may be routed to the wrong node.

This can be easily seen by doing the following. First run a 3 nodes mesh:

$ poetry run receptor --debug --node-id=controller -d /tmp/controller controller --listen=receptor://127.0.0.1:9999

$ poetry run receptor --debug --node-id=node-a -d /tmp/node-a node --listen=receptor://127.0.0.1:9998 --peer=receptor://localhost:9999

$ poetry run receptor --debug --node-id=node-b d /tmp/node-b node --listen=receptor://127.0.0.1:9997 --peer=receptor://localhost:9998

The above will start a mesh where controller -> node-a -> node-b. Then run two ping commands in parallel, one pinging node-a and the other pinging node-b. Use the controller node as the peer for both ping commands and set the same RECEPTOR_NODE_ID for both:

$ export RECEPTOR_NODE_ID="15477521-bcc0-446d-abc3-e3d80d57ec6b"

$ poetry run receptor -d /tmp/ping-a ping --peer=receptor://127.0.0.1:9999 --delay 0 --count 10 node-a  
{"initial_time": "2020-03-06T18:27:28.124886", "response_time": "2020-03-06 18:27:28.158935", "active_work": []}
{"initial_time": "2020-03-06T18:27:28.130734", "response_time": "2020-03-06 18:27:28.164155", "active_work": []}
{"initial_time": "2020-03-06T18:27:28.138053", "response_time": "2020-03-06 18:27:28.179573", "active_work": []}
{"initial_time": "2020-03-06T18:27:28.146100", "response_time": "2020-03-06 18:27:28.192977", "active_work": []}
{"initial_time": "2020-03-06T18:27:28.158007", "response_time": "2020-03-06 18:27:28.206066", "active_work": []}
{"initial_time": "2020-03-06T18:27:28.162428", "response_time": "2020-03-06 18:27:28.211155", "active_work": []}
WARNING 2020-03-06 13:27:28,237  receptor Received response to acccd167-2361-4b85-8d86-655bf1c70489 but no record of sent message.
WARNING 2020-03-06 13:27:28,247  receptor Received response to 2624d5ed-6815-481a-b16a-ef13844087ac but no record of sent message.
WARNING 2020-03-06 13:27:28,252  receptor Received response to 2248820e-0009-42f6-91b6-93cce83ef8df but no record of sent message.
WARNING 2020-03-06 13:27:28,256  receptor Received response to ebd639c8-1d23-4757-8270-247a1b357257 but no record of sent message.
^C

$ export RECEPTOR_NODE_ID="15477521-bcc0-446d-abc3-e3d80d57ec6b"

$ poetry run receptor -d /tmp/ping-b ping --peer=receptor://127.0.0.1:9999 --delay 0 --count 10 node-b  
{"initial_time": "2020-03-06T18:27:28.120859", "response_time": "2020-03-06 18:27:28.156398", "active_work": []}
{"initial_time": "2020-03-06T18:27:28.125913", "response_time": "2020-03-06 18:27:28.172401", "active_work": []}
{"initial_time": "2020-03-06T18:27:28.133262", "response_time": "2020-03-06 18:27:28.188726", "active_work": []}
WARNING 2020-03-06 13:27:28,207  receptor Received response to 9762e4e9-3fba-414c-9e4f-496ca0382634 but no record of sent message.
{"initial_time": "2020-03-06T18:27:28.140208", "response_time": "2020-03-06 18:27:28.205756", "active_work": []}
WARNING 2020-03-06 13:27:28,232  receptor Received response to 0e5de5d0-9c3e-40e0-b0db-3672c1b3399b but no record of sent message.
WARNING 2020-03-06 13:27:28,244  receptor Received response to 327ec7dc-d779-451e-988d-ba01e99bb36d but no record of sent message.
WARNING 2020-03-06 13:27:28,250  receptor Received response to c7547678-997e-4c38-8b29-d243d9e0ca4b but no record of sent message.
{"initial_time": "2020-03-06T18:27:28.163389", "response_time": "2020-03-06 18:27:28.237886", "active_work": []}
{"initial_time": "2020-03-06T18:27:28.170925", "response_time": "2020-03-06 18:27:28.246019", "active_work": []}
^C

Observe the WARNING messages on both ping command logs. Because both nodes had the same node ID and therefore the router sent incorrectly a message to a node that wasn't the expected one.

All the above is summarized by the following:

Receptor sometimes listens on random port

If receptor is started and no --listen flag is passed, it will listen on *:8888. However, if told to listen on receptor://127.0.0.1, then it will listen on a random port. This behaviour is inconsistent, and at odds with receptor's documented default of listening on port 8888.

--listen absent:

$ poetry run receptor --data-dir="$(mktemp --directory)" node
$ lsof -Pnp 12264 | grep LISTEN
python  12264 ichimonji10    6u     IPv4             254934      0t0      TCP *:8888 (LISTEN)

--listen present:

$ poetry run receptor --data-dir="$(mktemp --directory)" node --listen='receptor://127.0.0.1'
$ lsof -Pnp 11713 | grep LISTEN
python  11713 ichimonji10    6u     IPv4             258329      0t0      TCP 127.0.0.1:39235 (LISTEN)

Controller should better check listen argument

When I run controller like this (note 0.0.0.0/0)

receptor --debug --node-id controller -d /tmp/controller controller --listen=receptor://0.0.0.0/0:8888

here's the output

[root@foreman-nuc1 system]# receptor --debug --node-id controller -d /tmp/controller controller --listen=receptor://0.0.0.0/0:8888
INFO 2020-01-15 09:43:13,528 controller entrypoints Running as Receptor controller with ID: controller
INFO 2020-01-15 09:43:13,529 controller controller Serving on receptor://0.0.0.0/0:8888

not reporting any issue, but in fact it does not listen on 8888. It should either hard fails with some non 0 exit code or print error to logs.

KeyError: 'ws_extra_headers' in Receptor 0.6

Getting "KeyError: 'ws_extra_headers'" in Receptor 0.6 (shipped with latest Satellite 6.7 snap - 16). @matburt says this needs to be cherrypicked.

# receptor --debug -d /tmp/send send --peer=127.0.0.1:8881 node --directive=receptor_satellite:execute "$(cat data.json)"
INFO 2020-03-19 09:30:36,099  entrypoints Sending directive receptor_satellite:execute to node via 127.0.0.1:8881
DEBUG 2020-03-19 09:30:36,111  entrypoints Removing temporary directory /tmp/send/e6bd83ca-f85d-455d-9b0a-5e22d4b7f533
ERROR 2020-03-19 09:30:36,111  __main__ main: an error occured while running receptor
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/receptor/__main__.py", line 59, in main
    config.go()
  File "/usr/lib/python3.6/site-packages/receptor/config.py", line 516, in go
    self._parsed_args.func(self)
  File "/usr/lib/python3.6/site-packages/receptor/entrypoints.py", line 149, in run_as_send
    controller.run(send_entrypoint)
  File "/usr/lib/python3.6/site-packages/receptor/controller.py", line 89, in run
    self.loop.run_until_complete(app())
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/lib/python3.6/site-packages/receptor/entrypoints.py", line 116, in send_entrypoint
    config.ws_extra_headers, send_message, read_responses)
  File "/usr/lib/python3.6/site-packages/receptor/config.py", line 561, in __getattr__
    value = self._config_options[key]
KeyError: 'ws_extra_headers'

Need for HTTP proxy support

In case a user wants to let the traffic flow through HTTP proxy, receptor should provide options to specify proxy url together with username and password.

Deduplicate dependency and packaging tooling

PEP 518 and PEP 517 define how to install and execute a build system for Python packages, respectively. Of interest to developers is that they define a pyproject.toml file. For a project that depends upon setuptools, its complete contents would be as follows:

[build-system]
requires = ["setuptools", "wheel"]  # PEP 508 specifications.

When the build frontend executes, it will install and invoke setuptools, which will look for files like setup.py.

Projects aren't required to use setuptools to build packages, though. They could also choose a backend like poetry. If poetry was used, the pyproject.toml file would be significantly longer, and no setup.* files would exist.

This project appears to make use of both setuptools and poetry. This is confusing to me, and probably other humans as well. It's also a source of bugs. Consider:

pyproject.toml depends upon prometheus-client = "^0.7.1"
setup.py depends upon prometheus_client==0.7.1

Both the project name and the version specifiers differ! What's up with that? And:

setup.py lists two dependencies
pyproject.toml lists eleven dependencies

What's up with that? And more. There's more.

Could this project settle on a single build system? That would make life easier for humans, who only have to learn a single build system, and would provide a smaller surface area for bugs, as only one build system would need to be maintained, and two build systems wouldn't need to be kept in sync.

Exception raised when starting a receptor node with messages to expire

I've been working on trying to get more information about #79 and I had ping running and a node missing on the network so all the messages was being held on the controller node. Today when trying to start the controller again but now without cleaning the data directory it showed that the messages left yesterday were about to expire:

...
INFO 2019-12-20 11:09:37,812 controller file Expiring message 6f628495-04f7-4aa3-a48f-e2ca92b2dc67
INFO 2019-12-20 11:09:37,816 controller file Expiring message 21e3a76f-3170-4cf6-9146-c1a5cae54294

After getting to the end of expired log message the following exception appeared in the output:

Task exception was never retrieved
future: <Task finished coro=<Receptor.watch_expire() done, defined at /home/elyezer/code/receptor/receptor/receptor/receptor.py:48> exception=FileNotFoundError(2, 'No such file or directory')>
Traceback (most recent call last):
  File "/home/elyezer/code/receptor/receptor/receptor/receptor.py", line 53, in watch_expire
    await buffer.expire()
  File "/home/elyezer/code/receptor/receptor/receptor/buffers/file.py", line 121, in expire
    await self._loop.run_in_executor(pool, os.remove, self._path_for_ident(ident))
  File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/controller/messages/21e3a76f-3170-4cf6-9146-c1a5cae54294'

Test client reconnects

Ensure that when a server disconnects, that clients attempt to reconnect automatically.

[HIGH] - Router debug function call needs removing

Fixed by - ansible/receptor#68

Receptor exits with zero when unable to listen

Try running the following commands twice, concurrently:

poetry run receptor --data-dir="$(mktemp --directory)" node

One process will bind to 0.0.0.0:8888, and the other will fail to bind and bail. The process that bails should, intuitively, exit with a non-zero status code. However, it exits with a zero status code. This seems wrong.

Emit message storage information node status

This would let another node determine what messages are being held for delivery. This might also enable locating where an individual message may be stored in the mesh.

`--listen` parameter lets one specify nonsensical behaviour

I can start receptor like so:

poetry run receptor --data-dir="$(mktemp --directory)" node --listen=receptor://127.0.0.1:8888

This works fine. But let's say I start receptor like so:

poetry run receptor --data-dir="$(mktemp --directory)" node --listen=receptor://127.0.0.1:8888/foo

In this case, receptor will (to the best of my knowledge) silently discard the path /foo. This seems problematic to me. The user might expect one thing ("receptor will nest its listen interface under the /foo path"), but receptor will do something else. This could be a source of user confusion.

If receptor is going to discard the path, then why does it allow a path to be specified in the first place? Perhaps receptor should reject --listen arguments with information that'd be dropped anyway. For example, the following URL could be rejected:

>>> from urllib.parse import urlparse
>>> urlparse('receptor://127.0.0.1/0:8888')
ParseResult(scheme='receptor', netloc='127.0.0.1', path='/0:8888', params='', query='', fragment='')

The logic could be something like:

if res.path or res.params or res.query or res.fragment:
    raise InvalidListenArgumentError(...)

Routes sometimes fail to propagate

To the best of my knowledge, all receptor nodes in a mesh are supposed to exchange routing information such that they all know about every connection in the mesh. For example, if the following mesh is created:

Then every node should know about connections like:

diag_node -- controller
controller -- nodeX
controller -- node1
node1 -- node2
node1 -- node7
node1 -- node8
node1 -- node11

...and so on.

In addition, receptor nodes only transmit routing information under very specific circumstances, such as "a node has joined the mesh." Otherwise, nodes stay completely silent, and don't exchange routing information. This means that if a node somehow becomes out of sync and lacks information about a connection, it will stay out of sync until one of the triggering events occurs.

Several automated tests create a mesh, and then do something with that mesh. One of the simplest tests to do this is in test/perf/test_route.py::test_add_remove_node. Other test cases also illustrate this problem, but I point to this test case because of its simplicity. It does the following:

Create the mesh illustrated above, except for nodeX. Verify that all nodes have the same routing information.
Spawn nodeX and make it peer with the "controller" node. Verify that all nodes have updated routing information.

The node named "controller" is a controller node, and all other nodes are normal nodes. Unfortunately, the test sometimes fails, for one of two reasons:

Nodes completely fail to start, on the line of code that reads mesh.start(wait=True).
Nodes do start, but some node fails to learn about the new connection between nodeX and the controller, as discovered by the line of code that reads random_mesh.settle().

For the purposes of this bug report, the latter reason is the important one. I can reproduce this issue and start poking at it with the following command:

poetry run pytest test/perf/test_route.py::test_add_remove_node --pdb

Repeated test runs has shown the following results:

Routing info not propagated to node 1.
Routing info not propagated to node 12.
Routing info not propagated to node 12.
Routing info not propagated to node 12.
Routing info not propagated to node 12.
Routing info not propagated to node 12.
Routing info not propagated to node 12.

And in every single case, the issue has been that mentioned node lacks information about the connection between the nodes named "controller" and "nodeX." I can prove this with commands like this (issued from within the context of a Mesh instance):

>>> node12_routes = self.nodes['node12'].get_routes()
>>> node10_routes = self.nodes['node10'].get_routes()
>>> mesh_routes = self.generate_routes()
>>> mesh_routes == node10_routes
True
>>> node10_routes - node12_routes
{controller -- nodeX}
>>> node12_routes - node10_routes
set()

This issue can be fixed by forcing routing information to be re-broadcast, with a command like:

poetry run receptor --data-dir="$(mktemp --directory)" status --peer=receptor://127.0.0.1:...

I don't have any clue as to why routing information would fail to be propagated to a node, or why node12 is the node most commonly affected. I just know that it happens very frequently, and that it strongly indicates something wrong with the route propagation protocol. I'm running tests on localhost, which means that the network layer is about 100% reliable. This problem is likely to be way worse in the wild, where the network layer is much less reliable.

Perhaps this issue could be solved by taking inspiration from existing route propagation protocols. RIP is rudimentary, but would probably be better than the current state of affairs.

Propertly handle route configuration changes with nodes

Receptor controller can't route to itself

Description

Receptor controller can't ping to itself. And as a side effect a lot of Failed to ping node: No route found to controller messages are printed on the console.

Steps to reproduce

Run a controller:

$ receptor --node-id=controller -d /tmp/controller controller --socket-path=/tmp/receptor.sock --listen-port=9999

Try to ping the controller:

$ receptor ping --socket-path=/tmp/receptor.sock controller
Failed to ping node: No route found to controller
...

Expected results

Receptor controller could ping itself and return an output like the one below:

$ receptor ping --socket-path=/tmp/receptor.sock node-a    
{"message_id": "84803370-3b30-4349-8094-aeaba4bb5f8d", "sender": "node-a", "recipient": "controller", "message_type": "response", "timestamp": "2019-10-25T14:24:53.491198", "raw_payload": "2019-10-25T14:24:53.321769|2019-10-25T14:24:53.491104", "directive": null, "in_response_to": "2581e61a-c8c8-45e9-9320-18e8c0e077ab", "ttl": null, "serial": 1, "code": 0}

project-receptor / python-receptor Goto Github PK

python-receptor's People

Contributors

Stargazers

Watchers

Forkers

python-receptor's Issues

Description

Steps to reproduce

Expected results

Recommend Projects

Recommend Topics

Recommend Org