spotify / styx Goto Github PK

"The path to execution", Styx is a service that schedules batch data processing jobs in Docker containers on Kubernetes.

License: Apache License 2.0

Shell 0.09% Java 99.85% Dockerfile 0.04% Python 0.02%

styx's People

Contributors

Stargazers

Watchers

Forkers

rugby110 fabriziodemaria danielnorberg kanterov bcleenders rwstephenson dillonhicks zatine djkingassassin andrewalker19972 www3838438 jromare curioustauseef tinaranic lfischerstrom anish749 mortent andrusha narape diffblue-benchmarks ahmedriza bengardiner tnsetting vchandvankar vikram-spotify wade1990 zhuohuwu0603 jpetterssonatspotify ckiosidis deacondesperado zeta1999 stjordanis devopstoday11 doytsujin ekmixon elisiac clairemcginty shreejitverma sckelemen bug-hunting-github isabella232 korallin aleksandr-spotify honnix max-pavoni benkonz sonjaer andresgomezfrr brandon-segal sriganth3

styx's Issues

Retry delay calculation should not count missing dependency runs

The current exponential backoff calculation that is applied when an execution fails uses the tries counter for the delay calculation. This means that a job that has been missing dependencies for N executions fails on execution N+1, it will get a high retry delay even though it's the first actual execution that did something. See TerminationHandler.

Fix the retry calculation to not consider missing dependency runs in the delay calculation.

No longer possible to point cli to a specific port

STYX_CLI_HOST=localhost:8080 styx t schedules.yaml test-workflow 2017-04-04
Exception in thread "main" java.lang.IllegalArgumentException: unexpected host: localhost:8080
	at com.squareup.okhttp.HttpUrl$Builder.host(HttpUrl.java:705)
	at com.spotify.styx.client.StyxApolloClient.getUrlBuilder(StyxApolloClient.java:354)
	at com.spotify.styx.client.StyxApolloClient.triggerWorkflowInstance(StyxApolloClient.java:164)
	at com.spotify.styx.cli.Main.triggerWorkflowInstance(Main.java:375)
	at com.spotify.styx.cli.Main.run(Main.java:157)
	at com.spotify.styx.cli.Main.main(Main.java:137)

@fabriziodemaria I think this might be caused by this change: https://github.com/spotify/styx/pull/138/files#diff-5b13ae6a3917caa3674163023f79466aR354

How about using e.g. guava HostAndPort.fromString() to parse the STYX_CLI_HOST env var?

No feedback when Workflow references non-existing secret

If the secret field of a Workflow Schedule references a secret volume that does not exist in k8s, Pods will fail to start but Styx does not detect the failure. The result is that the execution hangs in SUBMITTED for the TTL period and then times out. No useful feedback about the cause of the issue is surfaced anywhere.

missing next_natural_trigger value in response

curl http://<host>/api/v2/workflows/<cid>/<wfid>/state

next_natural_trigger in response is always null

DatastoreExceptions are not handled

The Storage and DatastoreStorage interfaces declared throws IOException but if there is a datastore error DatastoreException is thrown. The DatastoreException is a runtime exception and is not a IOException subclass. Thus all storage call sites that do catch (IOException e) will not actually handle datastore exceptions.

E.g. https://github.com/spotify/styx/blob/master/styx-scheduler-service/src/main/java/com/spotify/styx/StyxScheduler.java#L600

Improve performance of backfill show

The way Styx handles backfill show is not fast, and if there are many backfills, API call to backfills?showAll=true&status=true will kill more or less kill API service.

Not able to run tests locally

Hello,

Not sure if this counts as and issue but I'm trying to run mvn clean verify locally but continually fail to do it so help would be much appreciated.

The test I'm stuck on right now is this one com.spotify.styx.client.GoogleIdTokenAuthTest

[ERROR] testServiceAccountCredentialsWithAccessToken(com.spotify.styx.client.GoogleIdTokenAuthTest)  Time elapsed: 1.947 s  <<< ERROR!
com.google.cloud.storage.StorageException: [email protected] does not have storage.objects.get access to the Google Cloud Storage object.
	at com.spotify.styx.client.GoogleIdTokenAuthTest.testServiceAccountCredentialsWithAccessToken(GoogleIdTokenAuthTest.java:180)

[ERROR] testServiceAccountCredentialsWithAccessTokenFailsIfMissingEmailScope(com.spotify.styx.client.GoogleIdTokenAuthTest)  Time elapsed: 0.047 s  <<< ERROR!
com.google.cloud.storage.StorageException: [email protected] does not have storage.objects.get access to the Google Cloud Storage object.
	at com.spotify.styx.client.GoogleIdTokenAuthTest.testServiceAccountCredentialsWithAccessTokenFailsIfMissingEmailScope(G

[ERROR] testServiceAccountWithoutTokenCreatorRoleOnSelfFails(com.spotify.styx.client.GoogleIdTokenAuthTest)  Time elapsed: 594.514 s  <<< ERROR!
com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: Credentials failed to obtain metadata
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: Credentials failed to obtain metadata
Caused by: java.io.IOException: Error requesting access token
Caused by: com.google.api.client.http.HttpResponseException: 
403 Forbidden

[ERROR] testImpersonatedCredentials(com.spotify.styx.client.GoogleIdTokenAuthTest)  Time elapsed: 570.659 s  <<< ERROR!
com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: Credentials failed to obtain metadata
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: Credentials failed to obtain metadata
Caused by: java.io.IOException: Error requesting access token

[ERROR] testServiceAccountCredentials(com.spotify.styx.client.GoogleIdTokenAuthTest)  Time elapsed: 0.069 s  <<< ERROR!
com.google.cloud.storage.StorageException: [email protected] does not have storage.objects.get access to the Google Cloud Storage object.
	at com.spotify.styx.client.GoogleIdTokenAuthTest.testServiceAccountCredentials(GoogleIdTokenAuthTest.java:146)

I managed to get passed the com.spotify.styx.api.ServiceAccountsTest by creating the following Terraform script

locals {
  styx_project = "test-styx"
  styx_test_user = "styx-test-user"
  service_account_access = "storage.objectViewer"
  service_apis = ["iamcredentials.googleapis.com", "storage.googleapis.com", "storage-api.googleapis.com"]
}

provider "google" {
}

resource "random_id" "styx_id" {
  byte_length = 4
  prefix = local.styx_project
}

resource "google_project" "test_styx" {
  name = local.styx_project
  project_id = random_id.styx_id.hex
}

resource "google_service_account" "styx_test_user" {
  account_id   = local.styx_test_user
  project = google_project.test_styx.project_id
  display_name = "Styx Test Service Account"
}

resource "google_service_account_key" "styx_test_user_key" {
  service_account_id = google_service_account.styx_test_user.name
}

resource "google_project_iam_binding" "styx_project_iam_binding" {
  project = google_project.test_styx.project_id
  role = format("%s/%s", "roles", local.service_account_access)
  members = ["serviceAccount:${google_service_account.styx_test_user.email}"]
}

resource "google_project_service" "services" {
  for_each = toset( local.service_apis )
  project = google_project.test_styx.project_id
  service = each.key
}

output "project_id" {
  value = google_project.test_styx.project_id
}

output "styx_test_user" {
  value = google_service_account.styx_test_user.email
}

output "styx_test_user_key" {
  value = base64decode(google_service_account_key.styx_test_user_key.private_key)
  sensitive = true
}

Then changing ServiceAccountsTest#SERVICE_ACCOUNT to the following

  private static final String SERVICE_ACCOUNT = System.getProperty(
          "styxTestServiceAccount",
          "[email protected]");

export the credentials, project number to env and then run maven with the following line

mvn clean verify -DargLine="-DstyxTestServiceAccount=styx-test-user@test-styxbdXXXXXXXX.iam.gserviceaccount.com"

The reason I added the storage.objectViewer to the account is because of the error I get from GoogleIdTokenAuthTest however I just noticed that I probably get this error because I don't have access to the gs://styx-oss-test/styx-test-user.json which I guess contains the styx-test-user credentials, what does it contain?

Right now my plan is to create styx-test-unusable-user, remap the user in the test, make gs://styx-oss-test/styx-test-user.json Uri to constant that can be remapped to read a local file and I have no clue what to do with styx.foo.bar host so I have just mapped the domain it to my localhost expecting it to have something to do with the metadataServer but I have no clue if this is a good approach please advice.

Is there something more that I will encounter after the client tests that I need to consider?

It would be great if I can get some help setting this up, if there is no time you can close this or I'll it if I give up or succeed.

styx doc: permission denied

Hi,

I am on Nixpkgs (master) on macOS and executing styx doc after generating the example data, leads to

/Users/plumps/.nix-profile/bin/styx: line 252: /nix/store/prr875wd0966mm220ykgsqh15ymqldil-styx-0.6.0/share/doc/styx/index.html: Permission denied

Hitting contention limits on Datastore

We observed a few cases of:
com.google.cloud.datastore.DatastoreException: too much contention on these datastore entities. please try again.

This happened when storing workflows: com.spotify.styx.storage.AggregateStorage.storeWorkflow

This seems to happen when a component contains a high number of workflows and all the workflows in the component are updated in a loop (at a fast rate). This is probably due to the fact that workflows belonging to the same component are currently part of the same entity group, for which write rate limits apply: https://cloud.google.com/datastore/docs/concepts/limits.

NOTE: The retry mechanism currently handles these occurrences with minimum impact.

Support stopping a backfill gracefully

Feature request - a command to stop triggering new instances in a backfill, but let the currently executing instances complete.

Inconsistent patchState behaviour

If a user patches the state for a Workflow/Component calling the related API, the following can happen:

At first, the user patches the Docker image for the specific pair Component/Workflow;
Afterwards, the user patches the Docker image for the Component, expecting it to be picked up by all the Workflows within the Component;

The Docker image at Workflow level is always returned first when preparing executions, so the new configuration at Component level will not be picked up.

State transitions are not retried

When a state transition fails (possible due to an intermittent problem, like Datastore transaction exception), the transition is not retried even once. Styx rather waits for the state to become stale (exceed the configured ttl, normally on the order of minutes or hours) and then issues a timeout event.

This introduces unpredictability in Styx behavior, now that we expect transaction collisions somewhat often. The possible user experience is that the workflow instance isn't starting for several minutes. The same thing also manifests in flaky system tests that wait for Docker runs to happen.

add API blueprint for deprecated API versions

For v0 and v1, we will need another blueprint to keep track those.

enabled and next_natural_trigger values being null

When patching at component level, enabled and next_natural_trigger in response should not be null.

Halt backfill improvements

Halting a backfill causes all backfill's instances to be halted. Failing to halt even a single instance will stop the halting procedure giving to the user a message for half-success (backfill is halted, but some instances are still running). We should retry to halt an instance in case of error.

Weighted resource use, or plain concurrency limit

The current implementation of resource usage for a Workflow is through a plain resource reference in the WorkflowConfiguration definition. This has the problem that it only allows for a unit resource usage, per workflow instance.

In general, we need to align on what we mean with "resource" and "concurrency". We have to treat these two concepts separately:

A general resource and resource usage mechanism (that we have today), need a better association mechanism from workflows to resources. There's however some amount of detail that has to be considered here
- One workflow instance can use several resources (e.g. available memory or cpu quota in a GCP project), and these have different magnitudes and units (GB, count). The current model is backwards as it requires the resource limit to be set in a unit that is normalized to 1 unit used per workflow instance. Instead, a more natural mechanism would be for the workflow to define how many of each resource it will consume.
- A workflow instance usually submits one or more processing jobs to various processing runtimes (Dataproc, Dataflow, Hadoop, etc), these might or might not happen concurrently in the workflow, depending on wiring. So a fixed resource use for the workflow instance is at best a "max resource use" definition.
A concurrency limit on a workflow instance is a very simple limit that does not reflect any real resource usage. It leaves it completely up to the user to figure out how a workflow instance relates to some (unknown to Styx) resources, and from that derive a concurrency limit.

The second, more simplified concurrency limit can be reduced to use the more general resource/use mechanism. And we can do it as an internal detail that does not leak to the user.

I like that we are taking the first approach, but we need to be aware of the model we're dealing with and make it better reflected in the user touch points of Styx.

One immediate change that we need to do is to change the definition of the resource usage association in the workflow definition to be an object rather than a plain string:

schedules:
  - id: example-workflow
    partitioning: hours
    resources:
      - id: nodes
        use: 32

The use field can default to 1 which is the current behaviour, but the schema having an object there will allow us to evolve the definition.

Missing secret fails silently

Halting and re-triggering in short succession causes premature state termination

When an actively running Execution is halted (through the cli or api) Styx will delete the Pod in k8s. This process takes some time to complete as the container have to receive the SIGTERM signal and exit. It will get some time to exit gracefully (the so called grace period).

If the same Workflow Instance is triggered and a new Execution is started during this shutdown period, there will be two Pods in k8s which are associated with the same Workflow Instance (through a Pod annotation). When finally the first Pod exits (usually with status code 137, 128 + SIGTERM(9)), a terminate(137) event is generated for that Workflow Instance. This event is dispatched to the currently active state and we see a premature state transition.

At this point, the second execution Pod is orphaned.

A sequence of events outlining this scenario:

2016-11-28T14:39:57       triggerExecution          Trigger id: ad-hoc-cli-1480343996990-18517
2016-11-28T14:39:57       submit                    Execution description: ExecutionDescription{dockerImage=stagger:3, dockerArgs=[{}], secret=Optional.empty, commitSha=Optional.empty}
2016-11-28T14:39:57       submitted                 Execution id: styx-run-0ab59615-979b-4717-80f3-deab984d1074
2016-11-28T14:40:00       started
2016-11-28T14:41:17       halt
2016-11-28T14:41:22       triggerExecution          Trigger id: ad-hoc-cli-1480344082837-41704
2016-11-28T14:41:23       submit                    Execution description: ExecutionDescription{dockerImage=stagger:3, dockerArgs=[{}], secret=Optional.empty, commitSha=Optional.empty}
2016-11-28T14:41:23       submitted                 Execution id: styx-run-12994ebb-c7f7-4083-8844-952ecd63b3f8
2016-11-28T14:41:27       started
2016-11-28T14:41:48       terminate                 Exit code: 137
2016-11-28T14:41:48       retryAfter                Delay (seconds): 180

The terminate event at the end is actually coming from the first execution, styx-run-0ab59615-979b-4717-80f3-deab984d1074.

Fix log level according to #212

According to #212. We agreed that user error shall be logged as info; events, state machine transition, etc. shall be logged as debug.

Support CRON syntax for Workflow schedules

The current schedule partitioning definition only allows for a limited set of intervals (hourly, daily and weekly). This is of course very limiting and makes Styx's use very limited. We should add UNIX Cron syntax for the partitioning interval definition.

I've investigated how this could be implemented using the cron-utils library, and the results are promising. The way the library API allows for parsing a cron expressions and then calculating previous/next instants of execution fits the internal representations of Styx very well.

Here's an example of how the current interval definitions can be translated to equivalent cron expressions:

=== hourly ===
0 * * * *
description = every hour
lastExecution     = 2017-01-07T00:00Z
nextExecution     = 2017-01-07T01:00Z
nextNextExecution = 2017-01-07T02:00Z

=== daily ===
0 0 * * *
description = at 00:00
lastExecution     = 2017-01-07T00:00Z
nextExecution     = 2017-01-08T00:00Z
nextNextExecution = 2017-01-09T00:00Z

=== weekly ===
0 0 * * MON
description = at 00:00 at Monday day
lastExecution     = 2017-01-02T00:00Z
nextExecution     = 2017-01-09T00:00Z
nextNextExecution = 2017-01-16T00:00Z

=== monthly ===
0 0 1 * *
description = at 00:00 at 1 day
lastExecution     = 2017-01-01T00:00Z
nextExecution     = 2017-02-01T00:00Z
nextNextExecution = 2017-03-01T00:00Z

=== yearly ===
0 0 1 1 *
description = at 00:00 at 1 day at January month
lastExecution     = 2017-01-01T00:00Z
nextExecution     = 2018-01-01T00:00Z
nextNextExecution = 2019-01-01T00:00Z

=== custom ===
*/5 0,12 * * MON,WED,FRI
description = every 5 minutes at 0 and 12 hours at Monday, Wednesday and Friday days
lastExecution     = 2017-01-06T12:55Z
nextExecution     = 2017-01-09T00:00Z
nextNextExecution = 2017-01-09T00:05Z

We can of course keep support for the text definitions as they match the commonly used macro names.

Make API endpoint used in CLI configurable

The API endpoint url is right now a constant in the CLI code. It should be configurable to point to actual deployments of the Styx service.

https://github.com/spotify/styx/blob/master/styx-cli/src/main/java/com/spotify/styx/cli/Main.java#L61

Java 9 Illegal Access Warnings

java -version
java version "9.0.1"
Java(TM) SE Runtime Environment (build 9.0.1+11)
Java HotSpot(TM) 64-Bit Server VM (build 9.0.1+11, mixed mode)

$ styx ls
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.inject.internal.cglib.core.$ReflectUtils$2 (file:/usr/local/Cellar/styx-cli/1.0.75/libexec/styx-cli-1.0.75.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.google.inject.internal.cglib.core.$ReflectUtils$2
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

I guess this is due to apollo using guice.

google/guice#1085

Initial workflow happens after "offset" interval which means a long wait time for larger offsets

During workflow initialisation, the next trigger instant is offset duration after the current time. This means that for larger values of offset (like 30 days, 90 days) the natural trigger would be after 30 days with the partition instant being the current time. While for smaller values (a few hours or 1 day offsets) this works well, but makes it harder to test and iterate fast for larger offsets. Larger offset values are typically used for tracking lagging metrics.

Changing the initial trigger to subtract the offset would solve the issue and would allow for a trigger to happen at the next schedule interval but with partition value which is offset by the given interval behind time. However, this would break the API as documented, since the workflow should trigger after the offset duration, according to the doc.

Proposal:
Add a separate flag in workflow which would allow only the initial trigger to happen at the next schedule interval. (At the next hour if hourly, at midnight, if daily, etc) with a partition value at 'offset' duration back. The following natural triggers can however happen every 'schedule'.

Smeared workflow execution

We would like to smear execution of scheduled workflows over time to achieve a smoother less bursty resource usage.

Would this imply smearing out the actual triggering (i.e. creation of active states) or would it make more sense to smear/throttle the actual execution of the workflows by e.g. some additional "smear-wait" state?

@rouzwawi WDYT?

Document injected environment variables

Add documentation describing the injected environment variables and what they mean.

https://github.com/spotify/styx/blob/master/styx-scheduler-service/src/main/java/com/spotify/styx/docker/KubernetesDockerRunner.java#L109

🐛 BUGFIX: Handle Flyte error codes that come from Dynamic Workflows

Description

While working with the deployment of the dynamic workflow, it was found that the workflow would result return an error code of RetriesExhaused|User:NotReady when there was a dependency missing instead of User:NotReady, which is the typical error code when a dependency was missing. Styx uses these error codes returned by Flyte to determine what status the Styx workflow instance should be, and if it is User:NotReady, the system will return a 20 error code for a missing dependency. (relevant code) With the Flyte team's help, the issue could be tracked to a set of locations in the Flyte propeller code.

Ideal Behavior

Styx returns a Missing Dependencies error code when the error code contains User:NotReady

Current Behavior

Styx returns an unknown error error code when the error code is not exactly User:NotReady

Possible Cause

Within the Flyte propeller code base, it was found that dynamic workflows will raise a RetryableFailure status if any dynamically generated nodes fail (relevant code). Once this status is raised for the dynamic workflow, the Flyte propeller will prepend the error code with RetriesExhaused| before the dynamic node's original error code (relevant code).

The impact is that any dynamic workflow cannot raise a User:NotReady in a way Styx can identify. This will result in erroneously labeling workflows as having unknown errors when the team may be raising error codes known to the Styx service but not recognized due to the RetriesExhaused string prepended to it.

Suggested Remediation

A possible remediation to this to allow dynamic workflows to raise specific Styx errors is to remove the RetriesExhaused| String before matching it to any of the known error codes.

Support configurable retry count

Use case: a very costly and long running job, where owners prefer an alert upon first failure, instead of retrying automatically.

One command to retrigger regardless of state

It seems we have forced annoying guesswork onto users, by providing styx trigger (which works if the workflow instance is inactive) and styx retry (which works if the workflow instance is active). Users tend to try both and see whichever works - or they would need to be aware of the workflow instance state to know which variant will work.

I suggest that retry should go away or just be an alias of trigger, they should work in all cases, except when the workflow is disabled or the instance is already running, and in those cases they should provide feedback and perhaps an option to use force...

Changing Workflow partitioning causes trigger time offsets

When changing a Workflows partitioning configuration, the previously stored nextNaturalTrigger time is kept, which potentially leads to an offset trigger time.

Reproducing

found an hourly Workflow that was enabled
nextNaturalTrigger: 2016-12-08 (19:00:00.000) UTC
changed to daily
it was triggered at 2016-12-08 (19:00:00.000) UTC
nextNaturalTrigger: 2016-12-09 (19:00:00.000) UTC

The workflow is now offset to 19:00 UTC instead of 00:00 which a daily workflow would trigger at normally.

Need more reliable halt

The Styx CLI halt command works via the common QueuedStateManager, and slow processing of the queue is one of the more common overload symptoms in Styx. Thus when Styx is overloaded with elevated queued events count, styx halt doesn't appear to do anything.

The CLI error is moreover quite uninformative, like API error: 500 : "Request Request{method=POST, url=https://styx-scheduler.spotify.net/api/v0/halt, tag=Request{method=POST, url=https://styx-scheduler.spotify.net/api/v0/halt, tag=null}} failed"

Silent failure on triggering non-existent date

The date 2017-04-31 does not exist, but:

$ styx t dano-test-pipeline dano-test 2017-04-31
Triggered! Use `styx ls -c dano-test-pipeline` to check active workflow instances.

$ e dano-test-pipeline dano-test 2017-04-31
TIME                      EVENT                     DATA