uber / cadence Goto Github PK

Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.

Home Page: https://cadenceworkflow.io

License: MIT License

Makefile 0.23% Go 99.60% Shell 0.14% Dockerfile 0.03%

uber cadence workflows orchestration-engine workflow-automation distributed-systems service-bus service-fabric services-platform golang

cadence's People

Contributors

Stargazers

Watchers

Forkers

tamer-eldeeb sivakku samarabbas sparklogic mxk1235 yiminc-zz isitwhoisit aryanugroho jbaski cluo longquanzheng parasitew vodelerk reactual vvelikodny juniiorf upnrunnhq alexdrinkwater flashtony2005 skielosky haoxuu arthurgan ansafinney mfateev superchaoran mechanicalai mudit3774 nathanboktae servicefoundation ryanwalls sgajera sharath1709 meiliang86 uber-qlam etsangsplk hubbucket-team halakaraki 3dsim slimakcz ramananm royadityak rowhit zhouyonglong billjedi 981724480 summer-ji-eng yaronsumel kumar-sundaram pankeshgupta golandr venkat1109 lonelywolfrider mactaggart andrewjdawson2016 bayesianmind richardbolt appcoreopc kvds-kalyan bolinov spencerx aminelaadhari lihannan99 kymr shyamalschandra aburan28 geoffbaker z4-box pradeeptg classjava shadowwalker2718 minhnhut0602 liuyu81 xiemaisi sagikazarmark marcusbooyah pgohite renesugar vkuzmin-uber agent-tao joaolpinho keeblerelvis s8sg vedagarwal yycptt cajbecu henrywu2019 yogendra-prasad huiwenhan jverce fjfd gs- hbcbh1999 danieldroit jithinraj giogkarakis rjammala hipsterelitist chaitanyaphalak bsmr stvliu

cadence's Issues

Schema: Versioning and upgrade support for cadence schema

Expose admin endpoints for all services

We need to expose http endpoints on all services which can be used to query internal state of the host and also query service health information.

Matching Service: Not to assign decision/activity tasks to Poll connections that are closed by client

History Service: Mutable state API cleanup

Currently mutable state is only used for small part of the API. This work item is created to track it is used for all API calls on History service and is updated to keep track of all relevant information like:

ActivityInfos
TimerInfos
OutstandingDecision
NextEventID
ChildWorkflows
Potentially any Signal ID if it makes sense for any API

Frontend should not serve requests before it's ready

The frontend service registers its thrift handler and starts the TChannel RPC server before it is fully initialized. We see issues where if the requests reach frontend before it's properly initialized it will go into panics.
We need to handle this similar to the way we handle history and matching services where we block incoming requests on a wait group until the initialization of the service is complete.

Matching Engine: Rate limit creation of new tasks for any TaskList

Every TaskList is mapped to single cassandra partition. So if we have all shards writing events to single TaskList, than it becomes the scalability bottleneck for the system. If Sync matching is not happening we end up writing all the tasks to cassandra and under lots of load cassandra transactions start timing out. This behavior ends up in generating very large number of duplicate tasks.
I think we need to put a rate limiter on TaskList to prevent this situation from happening. We should just return a throttle error back to client, and have the client backoff and retry failures. This should cause the system to degrade gracefully under extreme load.

History Service: Component Metrics for Mutable state/Timer Queue

History Engine: Timer optimizations

Currently all timers are created on each activity and decision task. We need to implement the logic to create a single timer for each workflow execution and set the next earliest timer when that one fires.

Cadence Feature: Support for Query Decision Task API

This is for monitoring purpose, cli style scenarios. This allows us to get call-stacks, debug stuck issues, etc without hosting decider implementation.

Matching Service: Start task failure handling

Matching Engine: Support for caching tasks for a tasklist

Support for Archival of Workflow Execution History

History cache invalidation on Cassandra timeouts

we have an issue where if we got a timeout error while updating the wf mutable state, we couldn't guarantee that we read the correct, latest state on reload. This is because the write could still be applied after executing the read.
This could have lead to corrupting the Events table if we tried to use the stale next_event_id value for subsequent writes.

Matching Service: Priorities on TaskList dispatch

This is to enable the scenario to give higher priorities to task for outstanding workflows rather than newer ones. So we can complete outstanding ones faster in the event of backlog.

Deletion of history events on workflow completion

We mark the workflow execution row with a TTL in executions table on completion. This takes care of workflow execution entry in executions table but we still need leak space in the events table as we don't cleanup the history associated with that execution.
We could use the timer queue processor for this purpose and queue up a timer task to delete the execution history after retention period.

Remove pending task tracker in history engine

Since we serialize writes to cassandra anyway, the pending task tracker just adds unnecessary complexity. It can be replaced by a simple counter

Cadence Feature: Support for Child Workflows

Range Delete for Transfer Tasks/Timer Tasks/Task Lists

Frontend Service: Emit all relevant metrics

Basic Server Side Throttling

Cadence is a multi-tenant service and we need to protect against single bad user bringing the entire system down. This task is to implement basic throttling and quotas for each client.

Matching Service: TaskList throttling to allow users to limit activities per second

Create ActivityTaskScheduleFailed event in history on bad decisions

If RespondDecisionTask sends in bad request or corrupted data than we just silently ignore the activitySchedule decisions. Instead we need to add relevant failure like ActivityTaskScheduleFailed event and then also create a new DecisionTask for the decider. Here is an instance of the failure:
{"RunID":"c09c5b10-d240-4f8b-bc4c-5735c0bb3805","ScheduleID":212,"Service":"cadence-frontend","WorkflowID":"48018f57-0c39-4d4e-b055-e3df3fff7464","level":"error","msg":"RespondDecisionTaskCompleted. Error: BadRequestError({Message:Missing StartToCloseTimeoutSeconds in the activity scheduling parameters.})","time":"2017-03-07T13:56:56-08:00"}

History Service: Append-only history support

This includes all neccesary changes to schema, API, and engine.

Design for Namespace/Domain support in Cadence

Cadence Feature: Support for Custom Event in execution history

This is very useful for supporting scenarios like storing config as a custom event when the workflow execution is started. This will allow users to make configuration changes without breaking running instances.

Server side workflow/activity parameters validation

History Service: Component Metrics for Engine/Transfer Queue/Shard Controller.

History Service: Fix timer task creation on activity heartbeat

History service seems to be creating a timeout task on each heartbeat. Instead we should have last hearbeat time recorded in mutable state and only create new timeout when the first one expires based on the last value for last recorded heartbeat.

Cadence Feature: Support for ContinueAsNew

Visibility and Debugging API

Shard assignment to nodes is uneven

Matching Service: Support for batching writes to cassandra on CreateTask

DeleteWorkflowExecution is not transactional with workflow completion update

When decider responds back with complete workflow decision, we first update the execution with new events and then delete workflow execution as a separate transaction. This can cause issues when the update times out but we successfully apply the update. This can cause us to never delete this workflow execution.
We need to make sure that execution is update and deleted in the same transaction.

Design task to expose mutable state to client-side

Certain workflows are easy to write if mutable state is exposed directly to client for making decisions instead of history. Workflows like cron will prefer this model and it is much more optimized for such scenarios. Also using mutable state for things like activity retries are much preferable rather than having client implement the retry logic.

History client support for host redirect

Now we have support for returning the correct host information when API calls to history service fails with ShardOwnershipLostError.
History client needs to look into the ShardOwnershipLostError and retry the request given the host information as part of the error.

Cadence Feature: Support for Workflow Timeout

CreateWorkflowExecution flag to support fail creation if workflow already completed

Currently Cadence has support for dedupe on workflow-id if the execution is still running. There are scenarios where workflows are fast running and completes immediately, so it would be super useful to have support for dedupe on workflow-id on completed executions also.

Matching Service should not serve requests before it's ready

The matching service registers its thrift handler and starts the TChannel RPC server before it is fully initialized. We see issues where if the requests reach matching engine before it's properly initialized it will go into panics.
We need to handle this similar to the way we handle History Service, where we block incoming requests on a wait group until the initialization of the service is complete.

Add integration test for Timer Cancellation

Add an integration test to exercise timer cancellation.

Matching Engine can lose decision tasks

By design, the matching engine can lose tasks even before recording in the execution history that they started. This is OK for activity tasks, since there are always timeouts for them.
On the other hand, there is no ScheduledToStart timeout for decision tasks (to avoid unnecessary timeouts in case decider was down or not polling tasks). If the decision task is lost, the workflow execution will get stuck forever.

Cadence Feature: Support for Terminate Workflow Execution

Cassandra Schema: Convert to use cassandra timestamps for various parts of the persistence API

For example: updated_at.

Cadence Feature: Support for Signals

Report metric for persistence API for HistoryManager

Create Lock Manager to serialize access to executions

Right now, every request gets a WorkflowExecutionContext from the cache and then acquires a lock on that object. It is possible in edge conditions that two requests end up with two different context objects (request 1 gets the context, the context gets evicted from the cache, then request 2 creates a new object). This will break the guarantee that only one write per execution originates from the history engine at a time.
We can fix this by having a central lock manager that grants locks on executions instead of locking the context object itself.

Cadence Feature: Restart failed workflows

This feature is to support restarting workflows from a given point in workflow execution history. Basically you want to preserve the history of an execution up to a point and restart from that location. Very useful when workflow fails due to a bug at a certain point and you want to restart a workflow after fixing the bug.