nytlabs / st-core Goto Github PK

View Code? Open in Web Editor NEW

37.0 37.0 8.0 1 MB

a data flow graphical programming language for data science

License: Apache License 2.0

Go 58.31% CSS 1.47% HTML 0.66% JavaScript 39.43% Shell 0.14%

st-core's People

Contributors

Stargazers

Watchers

Forkers

mikedewar nikhan ericschles jasonmoo vishnuvr abelsonlive gophersgang nubedev

st-core's Issues

control flow example

there needs to be an example that shows how to control flow with gate and latch

package HTML/related assets

assets used for display should either be embedded in the binary of placed in a path somewhere. st-core should not use rely on relative paths for assets.

blocking example

an example that demonstrates blocking, what it is, and how to work with it. this should probably include several different circumstances for examples.

pressure in a chain example

how pressure works in a chain of connected blocks.

option to hide a group's internal connections

There should be the option, even if it's not the default, to hide a group's internal connections. These are the ones that have been given a value, or are connected to other blocks in the same group. The reason is that the group, at the moment, is only doing half a job of simplifying the pattern. For example:

the ==publish, the peek and shift and the kvIncrement groups aren't particularly complex, but look like these big fat ICs lying around in the pattern. While it's certainly better than them all being exposed, the ==publish filter could just have two inputs and one out, the peek and shift could be one out and two value connections, and the kvIncrement could be two ins and two value connections.

it's not a toolkit

A toolkit is an assembly of tools; set of basic building units for user interfaces.

http://en.wikipedia.org/wiki/List_of_toolkits

splitting/joining data example

an example that shows the ingest of some data, have the data split up and operated on through different means, and then joined at the end.

Route Values and Paths Proposal

An inbound route should:

allow connections to be made from other routes
provide the ability for the route to be set to a value if there is no connection
provide the ability for a path to be set that extracts a subset of the inbound message (be it an inbound message or a set value) to be presented to the central routine of the block

Paths

A path allows the user to specify which bit of a message to use in the block. Paths use go-fetch by @nikhan which is a specific subset of jq/gojee that is only for access and not for manipulation at all. A path will default to . which indicated the whole method.

Each route has a path which can be set by the UI. A path is always a string like ".foo[3].bar, though the parsed fetch.Query is actually stored in the Input. Type checking of an input occurs once the query has been executed by way of type assertions inside the block.

Values

The user should be able to set a value for any route. The value will be treated as satisfying that route, allowing the block to proceed when all other routes are satisfied. If a block has only one route, and that route's value is set, then the block will process that value continually until something downstream blocks it - effectively turning that block into a pusher block. Similarly if a block has multiple routes that are all set to values the block will operate on all those values together constantly.

In the background, the set values will appear to the block as messages that are always ready, by instantiating a small pusher goroutine against each route. The goroutine will run until a connection is made or until the value is changed. If the value changes a new goroutine will appear in its place.

The UI will allow the user to set the value as a JSON string, the API is responsible for parsing the string and sending to the Input's SetValue method.

Internally to the block, values will be treated exactly the same as messages that came from another block. They still appear as Message objects on an inbound Connection. This keeps everything nice and simple.

blocks, stores, streams, pollables

blocks: operate on data
stores: store data
streams: events are delivered from or to
pollables: state is requested from elsewhere and turned into messages

Priority Queue Proposal

This is to suggest how priority queues - an important language construct for dealing with streaming data - should appear in the language.

A priority queue is a store - and will live in memory in core.

Highest priority is the lowest number in the queue.

It will have associated blocks:

pqPush - add an entry with a supplied time to the queue
pqPop - remove the highest priority element
pqPeek - look at the highest priority element without removing it

Sample patterns to come below.

writer type not thread safe!

a single writer can possible be sent to multiple blocks at the same time. we need to double check that this is OK.

errors on a different output

If errors are on a different output then

it becomes easier to have separate error-handling patterns
unhandled errors block the pattern
it becomes evident as to which blocks can produce errors

add ability to set group label

fix regression in #115

pqPop going faster than pqPush gives weird behaviour

This pattern grinds to a permanent halt if the top delay is 5s and the bottom is 1s. This might not actually be a bug necessarily, as the pqPop owns the priority queue and might have just locked it forever.

Block Description Proposal

Each block has a field "Description". In Description is a short description of the block's function.

memory race with websockets in server

there is a memory race somewhere in the server with the websocket that sends state updates. it is causing failures with chrome, bytes out of order, frame problems, etc.

need "comments"

spending a lot of time trying to figure out what the hell it was I am doing. Need a way of leaving myself hints.

groups and sources don't need block names

Groups should be unnamed until the user names them.

Sources should have their "route" be the name of the source, removing the need to name them, too.

palette

need a palette of blocks, if for no other reason than I have completely forgotten which blocks we've made. I guess it should be generaliseable so we can paste new blocks into it, and they get stored somewhere for re-use.

HTTP API

/ : GET streamtools UI
/library : GET all possible blocks
/version : GET current version
/status : GET current status
/import : POST add collection of blocks
/export : GET recieve
/blocks : GET lists all blocks; POST adds new block; DELETE clears all blocks
/blocks/{id} : GET gets block; PUT updates block; DELETE deletes block
/blocks/{id}/{route} : POST send message to route;
/ws/{id} : output stream for block {id}
/log : ui/log websocket
/connections : GET lists all connections; POST creates connection
/connections/{id} : GET describes conneciton; DELETE deletes block

Base msg type of interface{}

'nuff said

st-core concepts example

A rather intense document that should cover how the following works, and how the following can work together:

routes/connections
blocking
pressure
splitting / joining messages / reasoning about streams
how a block chooses which message to process with multiple inputs, what does it mean to have multiple inputs/outputs
how we consider cycles, what are cycles for, etc
sources/links
state (in stream vs in memory)
groups
data type (bool, string, etc)
resetting

Libraries

There will be a core library with an absolute minimal number of blocks:

ticker
bang
tolog
exec (a block that executes an arbitrary binary)

Additional libraries will exist in other repositories. Additional libraries can be imported by POSTing repo url to /library.

add copy/paste

fix regression #115

now block

A block that, when triggered, emits the current time as ms epoch

front-end rebuild pull request

the canvas rebuild needs to be finished

create splash page for streamtools v3.0

streamtools needs a new splash page at nytlabs.github.com/streamtools that reflects the new hotness

make group translations consistent

fix regression #115

currently, groups do not save translations and do not have relative coordinates when grouping/ungrouping.

rename st-core to streamtools, and rename streamtools to streamtools-legacy

the latter then the former

round block

need a block to round a number to the nearest integer

core blocks

This issue is to describes those blocks we currently feel should be included in st-core. Each comment should describe a block or a group of related blocks.

vocabulary

Let's try and get some nice vocab together under one issue so we can resolve any strangenesses, unify as many concepts as we can, and get st-core's concepts into a really tight subset.

minimal styling for autocomplete box

autocomplete box is dysfunctionally ugly. needs minimal styling.

server I/O example

there needs to be an example that details how information is put it and taken out of st-core

impossible send

It is possible to have one connection block the send of another when sending from the same pin to two different pins on a receiving block. this occurs because the receive pins receive in order (0, 1, 2, ...) and a blocked broadcast to pin 1 could stop a broadcast to pin 0, if pin 1 is attempted to be sent to first.

language proposal

Streamtools as a Language

The following document intends to illustrate what Streamtools might look like if it was actually a language.

Current Problems

Lack of modularity

Every time we want to do some kind of operation that is not currently afforded by Streamtools we need to write Go, recompile, rerelease. This is especially annoying for situations where we are limited by the grammar of ST. The lack of modularity is also impacts people who may wish to curate their own work environment with blocks that are specific to their work.

Inconsistant Flow Control

There are several problems with how ST currently uses flow for control.

Each block has a queue that is allowed to overflow and drop messages. This destroys the utility blocking as a flow mechanism.
Streams are not "typed" -- that is, anything coming over a stream can be a signal. Currently the combined "out routes"" for each block make it so that it is difficult to understand what kind of signal is being sent by a block that may have multiple possible outputs.
Whether a block needs to triggered with an "emit" "bang" "poll" and whether the block automatically emits a message when various inputs are hit is entirely a mystery and nonstandard.
It is EXTREMELY difficult to have two or more flows accessing the same state due to the above 2 points.

Inconsistant Block API

The inconsistent block API is product of inconsistent use of flow for control as well as inconsistent access to state. For instance, the cache, set, and histogram blocks should be very closely coupled. They are also lacking, in different ways, many operations we expect from a key-value data structure.

Proposal

Diagram 1

The above diagram represents a system where two messages enter the + block to produce a sum. This layout is impossible in the current instantiation of ST without some kind of workaround since it is impossible for the 2 input streams to ever arrive at the same time. An example workaround would be to associate the message containing '1' with the message containing '2' with a delay to wait for the message that has yet to be arrived. Another workaround would be something like our join block, where we assume order is correct.

This is undesirable, as it introduces all kinds of issues of synchronicity in the stream or unnecessary metadata embedded in each message. For instance, we could associate the wrong messages with one another or we could require a whole new system to deal with metadata in each message to resolve how the messages should be paired. Either way, this is added complexity that does not help in the development/understanding of a grammar consistent across blocks. This leads us to our first proposal:

1. Core blocks are responsible for ONE synchronous operation

This means that blocks like timeseries which seem to allow asynchronous access (the pushing to the timeseries and the polling of a timeseries) are no longer valid. The funny thing is that this rule actually isn't changing how blocks currently work internally: currently each block does one operation at at time, all this idea does is express the synchronous aspect already internal to the block to the input routes.

What the above means is that all input routes block until all input routes have received a message. Once all routes have been satisfied the operation of the block is triggered.

The above suggestion is a bit at odds with our current instantiation of the idea of "rules" which are used to give parameters to a block. Many of our blocks have "rules" that persist throughout the duration the blocks life and never need to be changed. Which leads us to...

2. "Constant" routes and "Path" routes

Diagram 2

Since our core blocks depend having all inputs satisfied once per block operation, this presents a problem when we have situations where we'd like a constant value to be applied as a parameter for a given block. We don't want to send one message per message in the above addition example if we want to add a constant to a stream.

Given the ability to configure how a route accepts and treats its value fixes this problem. In diagram 2, two modes of configuration are presented. The dialog on top is for a situation where a block is meant to accept a stream and retrieve a specific value from within an input object. In the case of the diagram, the route is configured accept the number for the + block from {foo:{bar:{baz:}}}. This is a "Path" route.

The preserve value checkbox is meant to satisfy the need when a block needs to be modified out-of-step with the other input stream. This means that if you have the left input stream running at 50hz, and you modify the parameter for the right stream at 1hz, that the value parameter is preserved for all messages until the right stream is updated again.

The bottom dialog in Diagram 2 presents an option to input JSON directly into the right stream input, so that it is always constant across operations. This is a "Constant" route. This provides us with 3 types of patterns:

Diagram 3

The middle pattern in Diagram 3 presents a problem: how can we guarantee two streams appear at that block in a way that is useful to us?

3. blocking and flow control

Diagram 4

Diagram 4 introduces some new blocks:

source: let's consider this some magical place where messages come from.
unpack: given the key-path configured in the input route, emit 1 message per array element found at that key-path location on the left output stream. Before starting a stream of elements, it sends the number of elements that it will be unpacking out its right output.
pack: pack accepts a the length of elements expected to be packed into array in the right input. Once a value is accepted, the right input blocks until the left input has satisfied the correct length. Upon reaching the length, the array is emitted, and the right input becomes unblocked.
set: set accepts a JSON object on the right input. This JSON object serves as a template for which the values (left input) are inserted. If the right input contains the value {foo: ""} (the value can be anything, it will be overwritten), every message emitted will be in the format {foo: value}.
merge: like +, merge requires an input on both left input and right input to commit an operation. merge combines two maps, privileging the right input.

This pattern takes advantage of blocking in order to maintain consistency across events. When a message is broadcast from source, it is immediately sent to merge which blocks source from sending any more messages into the system. Next, a message is sent to unpack, which immediately sends a message to pack. pack then blocks until the array length is satisfied, blocking unpack from sending any more messages downstream. Each message is added to 1 and sent to pack. Once pack emits the array, unpack is unblocked, and the message is set into an object. This object is sent to merge which unblocks source.

NOTE: The right input for pack may seem at odds with proposals 1 & 2. Instead of having a 1:1 relationship with its fellow left input, it has a 1:N relationship. Considering this block as a portal between many messages and a single message, I feel less icky about the situation.

4. Shared State

Diagram 5

One of the deficiencies in our current implementation is that complex operations with state are rather difficult. This is because we can only access the state via one block -- a block that has one output. Determining what the output means in regards to the state can be rather tricky -- and has resulted in the inclusion metadata into our outputs.

One way to amend the current situation is to afford multiple outputs that signal various circumstances. Similar to the unpack block in Diagram 4, various output routes can be added to afford the control of flow. For example, if a lookup query returns no result when burying a set, then instead of returning null (which is valid JSON that should be able to be included in a set) or nothing (which is unhelpful), we should send a signal out of a separate output channel, signaling that no message was found.

In addition to multiple outputs, another way to fix the problem of having to worry about how many different state queries work together is to divorce the state from the block. This amounts to having a completely new element in ST -- a data store.

The data store does not participate in message flow. It is a node that is associated with various blocks that act as an API to its contents. Diagram 5 illustrates what some of this blocks may look like. The "association" is illustrated as a route that exists on the side -- this is for illustration purposes only. There is no directionality when associating a block with a store.

This affords us the ability to run multiple control flows with a single data store and affords the construction of complex patterns.

Diagram 6

The above diagram illustrates a timeseries. values come in through source, source broadcasts one message to set which puts the value into an object, like {"value":22}. source also sends a message to now which produces a time in epoch ms. This epoch ms is also set into an object, like {"time": 140092818110000}. The two are merged, and then pushed to an array. push returns the new length of the array, and if the new length of the array is greater than our sample size (60), then pop an element off the array. If at any time we would like to view the timeseries, we send a message to dump which does not interfere with our push/pop flow at all.

5. Building Blocks

-><-

->Diagram 7<-

Building blocks is still not something I've entirely figured out. The above diagram displays how a map would work. It's a lot more work, but the data is much more clearly expressed. One thing you'll see is that merge has 3 inputs. I am of the mind that merge should be a magic block that can allow infinite inputs. I imagine this UI-wise as just dragging the block wider.

Here are some scattered notes on how I think custom blocks should work:

There should be some way to couple the input routes similar in the same way as the way the core blocks are coupled -- i.e., two inputs, no message is processed until both are satisfied. I suppose this could be represented internally to the block as an array.
I am not adverse to custom blocks with tons of routes doing all sorts of things. Not all custom block routes need to be coupled/block one another.
Blocks afford more flow control, so you can do things like only let one message into the block at a time to ensure that you don't mess up your consistency while doing crazy things. This could be signaled by some internal route -- something called "finish" -- that gets hit at the end of your pattern. This could signal that the block is ready for a new message.
Blocks can also afford the ability to do data-store locking. In the circumstance where you have a series of operations that have to act uninterrupted on a store, if you put them all in a single custom block, then that custom block should have uninterrupted access to the store while processing that single message.
Custom blocks should have access to external stores, but external stores should not have access to stores internal to the block.

Errata:
No more DSLs, for anything. All key paths are represented as JSON -- no go-fetch. I realized this at the very end here, but there is no reason to represent a key-path as a string. Just use a JSON object. Just imagine Diagram 2 has {foo:{bar{baz:""}}} instead of .foo.bar.baz.

to sum up, the goals of this proposal are:

remove as much DSL stuff as possible
ensure that all data is accessible to all operations (i.e., we can't operate on individual array elements in current ST)
remove all in-message metadata like we have in ST now
use streams for context of message (send signals, not metadata, i.e. pack/unpack, EOF, that kind of thing)
make a grammar of streams that incorporates blocking/locking
make a grammar that can recreate ST as it stands now
provide a means for some kind modularity, so that people can create their own library without having to rebuild ST/progrma Go

allow interrogation / display of the state of a route

A route's:

last message
inbound rate
current readiness to receive

should be queryable.

when importing pattern, start blocks at the same time when import is done

when importing a pattern, the entire pattern should remain in an "stopped" state until the pattern has finished its import.

block creation position does not take into account group translation

when panning inside a group, a block is created at a position not relative to the translated space. it should appear wherever the autocomplete box was invoked, taking into account the group translation.

reactive cache

For the life of me I can't think how to draw the cache block (or set or histogram or count etc) as a marble diagram (as in https://gist.github.com/staltz/868e7e9bc2a7b8c1f754). But, I reckon I could draw some diagrams that would describe a mechanism that implements the flow modification of a cache (if not the subsequent feedback):

I'm wondering then if it make sense to draw the diagram first, then write the block.

kernels and composition, patterns

A kernel is the core function of a block. It is a function that processes a set of inputs to produce a set of outputs. It is escapable and alerts on error.

A typical block contains a single kernel, however, a user may be able to compose a set of kernels into a single block.

A block composed of multiple kernels:

cannot contain any cycles
performs all kernel functions on a single message without any message passing
must be "compiled" (no hot swapping or inspection)
locks stores upon the receiving of a message, and unlocks them on the emitting of a message.(?)

This is different than what is proposed in the language spec (#8). That document proposes "blocks" that can be composed of multiple blocks -- however it does not say anything about composing a block's function. Because of the need to compose groups of blocks together, many of which may contain cycles, I propose that those are called something different -- like "groups"

A group:

can contain groups
can contain blocks
can have cycles
passes messages
can be inspected/hotswapped

build script for all platforms needs to be created

We need to be able to release binaries for all platforms

states and routes

a state is a marshalable variable that lives inside an instantiated block.
a block maintains multiple states.
each inbound route has an associated callback function.
each callback function can only see one input function.
each callback function can see all the states and all the outbound routes.
each state has an auto-generated accessor

add ability to set source parameters

fix regression from #115

export to gist

because that would be lovely!

API groups proposal

A Group is a collection of Blocks and other Groups.
A Group does not have to belong to an existing Group.
A Block does have to belong to an existing Group.
There is always a top-level Group with id 0 which acts as a Block's default Group.
A Group is for organisational purposes only.
Groups and Blocks are collectively called Nodes.
A Group knows what it contains - it knows its child Nodes.
A Node doesn't know who its parent is.

Code is here:
https://github.com/mikedewar/st-core/blob/master/thinking/api/main.go

The following test code:

curl localhost:7071/block -d'{"name":"A"}'
curl localhost:7071/block -d'{"name":"B"}'
curl localhost:7071/block -d'{"name":"C"}'
printf "\n"
printf "\n"
curl localhost:7071
printf "\n"
curl localhost:7071/group -d'{"ParentID":0, "ChildIDs":[1,2]}'
printf "\n"
printf "\n"
curl localhost:7071

returns

OKOKOK

0 -
  1
  2
  3

OK

0 -
  4-
    1
    2
  3

routes

No more query routes and rules.

Instead, every block has a number of inputs, a number of outputs, and a number of states.

Inputs
The only parameter an input has is path. This allows a user to specify where in the incoming JSON object to look. This path parameter is a simplified version of gojee, such that, it always returns a single value or it returns an error if a value is not found. It removes the capability of doing a query across array elements, such that .foo[].bar is no longer a valid (.foo[0].bar is still valid, however).

Instead of rules, there are a number of inputs. For instance: a count block would have a msg input and a window input. The window does not accept rules, it accepts a stream of parameters that are used to effect the state of the block. These parameters would look like a stream of strings, 30s.

(depends on #3)

State
Each block can have a number of states. These states can optionally have a pollable input pin/output pin emitter as well as an HTTP endpoint. The benefit of having a state over our current implementation is that it combines the functionality the production of a message for an HTTP endpoint as well as for the rest of the ST system, meaning a block author does not need to write multiple bits of code that do the same thing.

Outputs
Instead of having a single output, a block can have multiple outputs. These should be able to be named.