Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
Currently mutable state is only used for small part of the API. This work item is created to track it is used for all API calls on History service and is updated to keep track of all relevant information like:
ActivityInfos
TimerInfos
OutstandingDecision
NextEventID
ChildWorkflows
Potentially any Signal ID if it makes sense for any API
The frontend service registers its thrift handler and starts the TChannel RPC server before it is fully initialized. We see issues where if the requests reach frontend before it's properly initialized it will go into panics.
We need to handle this similar to the way we handle history and matching services where we block incoming requests on a wait group until the initialization of the service is complete.
Every TaskList is mapped to single cassandra partition. So if we have all shards writing events to single TaskList, than it becomes the scalability bottleneck for the system. If Sync matching is not happening we end up writing all the tasks to cassandra and under lots of load cassandra transactions start timing out. This behavior ends up in generating very large number of duplicate tasks.
I think we need to put a rate limiter on TaskList to prevent this situation from happening. We should just return a throttle error back to client, and have the client backoff and retry failures. This should cause the system to degrade gracefully under extreme load.
Currently all timers are created on each activity and decision task. We need to implement the logic to create a single timer for each workflow execution and set the next earliest timer when that one fires.
This is for monitoring purpose, cli style scenarios. This allows us to get call-stacks, debug stuck issues, etc without hosting decider implementation.
we have an issue where if we got a timeout error while updating the wf mutable state, we couldn't guarantee that we read the correct, latest state on reload. This is because the write could still be applied after executing the read.
This could have lead to corrupting the Events table if we tried to use the stale next_event_id value for subsequent writes.
This is to enable the scenario to give higher priorities to task for outstanding workflows rather than newer ones. So we can complete outstanding ones faster in the event of backlog.
We mark the workflow execution row with a TTL in executions table on completion. This takes care of workflow execution entry in executions table but we still need leak space in the events table as we don't cleanup the history associated with that execution.
We could use the timer queue processor for this purpose and queue up a timer task to delete the execution history after retention period.
Cadence is a multi-tenant service and we need to protect against single bad user bringing the entire system down. This task is to implement basic throttling and quotas for each client.
If RespondDecisionTask sends in bad request or corrupted data than we just silently ignore the activitySchedule decisions. Instead we need to add relevant failure like ActivityTaskScheduleFailed event and then also create a new DecisionTask for the decider. Here is an instance of the failure:
{"RunID":"c09c5b10-d240-4f8b-bc4c-5735c0bb3805","ScheduleID":212,"Service":"cadence-frontend","WorkflowID":"48018f57-0c39-4d4e-b055-e3df3fff7464","level":"error","msg":"RespondDecisionTaskCompleted. Error: BadRequestError({Message:Missing StartToCloseTimeoutSeconds in the activity scheduling parameters.})","time":"2017-03-07T13:56:56-08:00"}
This is very useful for supporting scenarios like storing config as a custom event when the workflow execution is started. This will allow users to make configuration changes without breaking running instances.
History service seems to be creating a timeout task on each heartbeat. Instead we should have last hearbeat time recorded in mutable state and only create new timeout when the first one expires based on the last value for last recorded heartbeat.
When decider responds back with complete workflow decision, we first update the execution with new events and then delete workflow execution as a separate transaction. This can cause issues when the update times out but we successfully apply the update. This can cause us to never delete this workflow execution.
We need to make sure that execution is update and deleted in the same transaction.
Certain workflows are easy to write if mutable state is exposed directly to client for making decisions instead of history. Workflows like cron will prefer this model and it is much more optimized for such scenarios. Also using mutable state for things like activity retries are much preferable rather than having client implement the retry logic.
Now we have support for returning the correct host information when API calls to history service fails with ShardOwnershipLostError.
History client needs to look into the ShardOwnershipLostError and retry the request given the host information as part of the error.
Currently Cadence has support for dedupe on workflow-id if the execution is still running. There are scenarios where workflows are fast running and completes immediately, so it would be super useful to have support for dedupe on workflow-id on completed executions also.
The matching service registers its thrift handler and starts the TChannel RPC server before it is fully initialized. We see issues where if the requests reach matching engine before it's properly initialized it will go into panics.
We need to handle this similar to the way we handle History Service, where we block incoming requests on a wait group until the initialization of the service is complete.
By design, the matching engine can lose tasks even before recording in the execution history that they started. This is OK for activity tasks, since there are always timeouts for them.
On the other hand, there is no ScheduledToStart timeout for decision tasks (to avoid unnecessary timeouts in case decider was down or not polling tasks). If the decision task is lost, the workflow execution will get stuck forever.
Right now, every request gets a WorkflowExecutionContext from the cache and then acquires a lock on that object. It is possible in edge conditions that two requests end up with two different context objects (request 1 gets the context, the context gets evicted from the cache, then request 2 creates a new object). This will break the guarantee that only one write per execution originates from the history engine at a time.
We can fix this by having a central lock manager that grants locks on executions instead of locking the context object itself.
This feature is to support restarting workflows from a given point in workflow execution history. Basically you want to preserve the history of an execution up to a point and restart from that location. Very useful when workflow fails due to a bug at a certain point and you want to restart a workflow after fixing the bug.
Hopefully, execution history should never get corrupted. If, for any reason (bugs?) we get into a state where this happens we should not just return a retriable error to the callers.