shopify / ci-queue Goto Github PK

View Code? Open in Web Editor NEW

166.0 166.0 26.0 766 KB

Distribute tests over many workers using a queue

License: MIT License

Ruby 85.48% Shell 0.05% Lua 1.26% Makefile 0.42% Python 12.79%

ci-queue's People

Contributors

Stargazers

Watchers

ci-queue's Issues

Allow to stop popping test if there already X failures

In some extreme cases, a single low level issue can make a huge portion of the test suite fail.

It can make sense in this context to stop running tests once X errors have been recorded, both to save CPU time and to keep a readable output.

cc @Shopify/pipeline

Add CHANGELOG

When Dependabot raises new PR's for ci-queue updates there is no information contained in the PR about what changed which forces a developer to dive into this project to understand what has happened.

Once inside this project, there is no CHANGELOG or Github releases which forces developers to manually inspect PR's or commits to understand what has happened.

This is a poor use of developer's time.

Adding a CHANGELOG or adding information to Github releases can streamline this process and give more confidence to consumers when updates occur.

Skipped tests showing up as errors in summary if another test failed

It looks like that skipped tests are showing up as errors in the list if there is another error happening.

[REDACTED]

Sometimes one worker goes rogue and doesn't stop at the end, making my build fail

I'm running ci-queue in a bit of an alternative setting. I've got a Ruby on Rails app which has a long test suite, and I'm running 8 concurrent instances of RSpec on Heroku CI, but on one and the same dyno, which also runs Redis in-dyno.

I start the whole chain with the following script

export CI_QUEUE_URL=$([ -z "$REDIS_URL" ] && echo "redis://127.0.0.1:6379" || echo "$REDIS_URL")

for num in $(seq 1 $PARALLEL_COUNT); do
  RUBYOPT="-W0" \
  CI_NODE_INDEX=$(expr $num - 1) \
  DATABASE_URL=$([ "$num" -ne "1" ] && echo $DATABASE_URL$num || echo $DATABASE_URL) \
  rspec-queue \
    --timeout 180 \
    --max-consecutive-failures 10 \
    --max-requeues 5 \
    --requeue-tolerance 5 $@ &
done

wait
rspec-queue --report

Every now and then when this runs on Heroku CI, 7 out of the 8 workers end at the same time, but it seems the 8th one keeps on running for hours until Heroku kills my build after 2 hours that usually passes in about 10-15 minutes. It's as if it doesn't understand that the queue is done.

Randomized with seed 57482

.

Finished in 9 minutes 51 seconds (files took 10.22 seconds to load)

85 examples, 0 failures, 1 pending

Randomized with seed 15517

< THIS IS WHERE I EXPECT THE REPORT >

.......................................................................................................................-----> test command `bin/test` failed with signal: terminated

While there are 8 instances running, the word Finished only shows up 7 times in the log, so I assume what I see here is the finishing of workers 6 and 7, and the following dots being part of worker 8.

I have no idea how to debug this further, but I'd love to add more details if someone can help me in the right direction.

A little more research

I've compared the output of a successful build and a failed build.

A successful build had 8 workers finish with a total of 1216 examples, an average number of 152.
A failed build had 7 workers finish with a total of 1194 examples reported.
- This means that at the time they finished, the 8th worker only processed 22 examples, which is remarkebly few compared to the average of 170 examples processed by the other 7. (This could potentially corroborate your theory). However:
- After that, the 8th worker drew 130 dots, which as I can only imagine refer to specs that already passed in different workers.

Error thrown when no worker-id is present when redis > 5.0.0

Since we updated redis to 5.0.5, we are facing errors whenever worker-id is not present:

$ bundle exec minitest-queue run -Itest $(find test -name \*_test.rb | sort)

bundler: failed to load command: minitest-queue (/tmp/bundle/ruby/3.1.0/bin/minitest-queue)
W, [2022-10-04T21:09:42.446048 #241]  WARN -- : calling Tracer#shutdown multiple times.
/tmp/bundle/ruby/3.1.0/gems/redis-client-0.9.0/lib/redis_client/command_builder.rb:37:in `block in generate': Unsupported command argument type: NilClass (TypeError)
	from /tmp/bundle/ruby/3.1.0/gems/redis-client-0.9.0/lib/redis_client/command_builder.rb:28:in `map!'
	from /tmp/bundle/ruby/3.1.0/gems/redis-client-0.9.0/lib/redis_client/command_builder.rb:28:in `generate'
	from /tmp/bundle/ruby/3.1.0/gems/redis-client-0.9.0/lib/redis_client.rb:211:in `call_v'
	from /tmp/bundle/ruby/3.1.0/gems/redis-5.0.5/lib/redis/client.rb:73:in `call_v'
	from /tmp/bundle/ruby/3.1.0/gems/redis-5.0.5/lib/redis.rb:167:in `block in send_command'
	from /tmp/bundle/ruby/3.1.0/gems/redis-5.0.5/lib/redis.rb:166:in `synchronize'
	from /tmp/bundle/ruby/3.1.0/gems/redis-5.0.5/lib/redis.rb:166:in `send_command'
	from /tmp/bundle/ruby/3.1.0/gems/redis-5.0.5/lib/redis/commands/sets.rb:21:in `sadd'
	from /tmp/bundle/ruby/3.1.0/gems/ci-queue-0.24.2/lib/ci/queue/redis/worker.rb:218:in `register'
	from /tmp/bundle/ruby/3.1.0/gems/ci-queue-0.24.2/lib/ci/queue/redis/worker.rb:211:in `push'
	from /tmp/bundle/ruby/3.1.0/gems/ci-queue-0.24.2/lib/ci/queue/redis/worker.rb:27:in `populate'
	from /tmp/bundle/ruby/3.1.0/gems/ci-queue-0.24.2/lib/minitest/queue/runner.rb:282:in `populate_queue'
	from /tmp/bundle/ruby/3.1.0/gems/ci-queue-0.24.2/lib/minitest/queue/runner.rb:87:in `run_command'
	from /tmp/bundle/ruby/3.1.0/gems/ci-queue-0.24.2/lib/minitest/queue/runner.rb:38:in `public_send'
	from /tmp/bundle/ruby/3.1.0/gems/ci-queue-0.24.2/lib/minitest/queue/runner.rb:38:in `run!'
	from /tmp/bundle/ruby/3.1.0/gems/ci-queue-0.24.2/lib/minitest/queue/runner.rb:19:in `invoke'
	from /tmp/bundle/ruby/3.1.0/gems/ci-queue-0.24.2/exe/minitest-queue:5:in `<top (required)>'
	from /tmp/bundle/ruby/3.1.0/bin/minitest-queue:25:in `load'
	from /tmp/bundle/ruby/3.1.0/bin/minitest-queue:25:in `<top (required)>'
	from /usr/local/ruby/lib/ruby/gems/3.1.0/gems/bundler-2.3.9/lib/bundler/cli/exec.rb:58:in `load'
	from /usr/local/ruby/lib/ruby/gems/3.1.0/gems/bundler-2.3.9/lib/bundler/cli/exec.rb:58:in `kernel_load'
	from /usr/local/ruby/lib/ruby/gems/3.1.0/gems/bundler-2.3.9/lib/bundler/cli/exec.rb:23:in `run'
	from /usr/local/ruby/lib/ruby/gems/3.1.0/gems/bundler-2.3.9/lib/bundler/cli.rb:483:in `exec'
	from /usr/local/ruby/lib/ruby/gems/3.1.0/gems/bundler-2.3.9/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
	from /usr/local/ruby/lib/ruby/gems/3.1.0/gems/bundler-2.3.9/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
	from /usr/local/ruby/lib/ruby/gems/3.1.0/gems/bundler-2.3.9/lib/bundler/vendor/thor/lib/thor.rb:392:in `dispatch'
	from /usr/local/ruby/lib/ruby/gems/3.1.0/gems/bundler-2.3.9/lib/bundler/cli.rb:31:in `dispatch'
	from /usr/local/ruby/lib/ruby/gems/3.1.0/gems/bundler-2.3.9/lib/bundler/vendor/thor/lib/thor/base.rb:485:in `start'
	from /usr/local/ruby/lib/ruby/gems/3.1.0/gems/bundler-2.3.9/lib/bundler/cli.rb:25:in `start'
	from /usr/local/ruby/lib/ruby/gems/3.1.0/gems/bundler-2.3.9/exe/bundle:48:in `block in <top (required)>'
	from /usr/local/ruby/lib/ruby/gems/3.1.0/gems/bundler-2.3.9/lib/bundler/friendly_errors.rb:103:in `with_friendly_errors'
	from /usr/local/ruby/lib/ruby/gems/3.1.0/gems/bundler-2.3.9/exe/bundle:36:in `<top (required)>'
	from /usr/local/ruby/bin/bundle:25:in `load'
	from /usr/local/ruby/bin/bundle:25:in `<main>'

This problem only appears when worker_id is nil, since we send it down to redis which is now doing more strict type checking.

I think in the ideal scenario, worker_id should never be nil, but in that case, the command should fail in argument validation whenever no worker ID was provided or inferred, instead of failing when trying to access redis

ci-queue: Classification Check

Classification Check

TL;DR

1st party software must have a business impact classification. See our list of classification to help you determine the classification of your application.

Why is this being asked?

It's critical that we understand the business impact of the software we write. The classification of your service determines the level of support you must provide as owners and helps us set standards.

What will happen if it doesn't get done within the expected timeframe?

This is mandatory. If this is not done you won't be able to have any runtimes.

When does it need to get done?

At the latest, this should be done before 2021-07-22.

This doesn't apply to my service. What do I do?

First, leave a comment explaining why it doesn't apply. Then, leave another comment as /not_applicable, and close the issue.
If you change your mind, make sure to comment any reason and reopen the issue.

I have questions/concerns about this

Please contact the Production Excellence team using Slack at #production-excellence-team.
Your service: ci-queue/rubygems
Owners:

pytest: check if tests are already pushed before gathering all the tests

Currently, all worker nodes

calculate the list of tests to push, then
check to see if that list has already been pushed to master, and if not,
try to push it.

We can shave off some test time if a worker first

checked to see if the list of tests has already been pushed. And if not,
calculate the list of tests to push, ...

Allow to reproduce order-related failures by printing a full executable command

When an order-related failure happens is very difficult to reproduce it locally for large suites (which are likely among users of this library).

Having a way to reproduce the exact sequence locally would be a great help and would considerably reduce --bisect times.

If there's consensus I can try to propose an implementation for this 🙋‍♂️

0.28.0 version is not compatible with redis 4.8.1+

This PR introduced the usage of a method sadd that always returns an integer value.

See failed CI run https://buildkite.com/shopify/shop-server/builds/167914#01886d13-625e-4fbe-a514-5c289ee654dd

Implement retrying a test in the pytest integration

Currently the pytest implementation of ci does not take advantage of the library's ability to retry a test. The suggested pseudocode implementation can be found in the readme, but concretely, it will likely involve adding functionality to these two methods

`minitest-queue` `ENOENT` error if `log` folder does not exist

👋 Running into a ENOENT error when the gem tries to open up the log/test_order.log file and the log directory does not exist.

Running this command on CI:

bundle exec minitest-queue run --max-requeues 3 --requeue-tolerance 0.01 --timeout 45 -Itest:lib $(find test -name \*_test.rb | sort)

Causes this error:

/usr/local/bundle/gems/bundler-1.17.3/lib/bundler/rubygems_integration.rb:200: warning: constant Gem::ConfigMap is deprecated
--
  | Traceback (most recent call last):
  | 8: from /tmp/bundle/ruby/2.7.0/gems/minitest-5.14.1/lib/minitest.rb:68:in `block in autorun'
  | 7: from /tmp/bundle/ruby/2.7.0/gems/minitest-5.14.1/lib/minitest.rb:139:in `run'
  | 6: from /tmp/bundle/ruby/2.7.0/gems/minitest-5.14.1/lib/minitest.rb:849:in `start'
  | 5: from /tmp/bundle/ruby/2.7.0/gems/minitest-5.14.1/lib/minitest.rb:849:in `each'
  | 4: from /tmp/bundle/ruby/2.7.0/gems/minitest-reporters-1.4.2/lib/minitest/minitest_reporter_plugin.rb:16:in `start'
  | 3: from /tmp/bundle/ruby/2.7.0/gems/minitest-reporters-1.4.2/lib/minitest/minitest_reporter_plugin.rb:16:in `each'
  | 2: from /tmp/bundle/ruby/2.7.0/gems/ci-queue-0.20.5/lib/minitest/queue/order_reporter.rb:11:in `start'
  | 1: from /tmp/bundle/ruby/2.7.0/gems/ci-queue-0.20.5/lib/minitest/queue/order_reporter.rb:11:in `open'
  | /tmp/bundle/ruby/2.7.0/gems/ci-queue-0.20.5/lib/minitest/queue/order_reporter.rb:11:in `initialize': No such file or directory @ rb_sysopen - log/test_order.log (Errno::ENOENT)

Committing an empty log folder at the root of my repo fixes it. The folder should just be created on its own if it doesn't exist and is required at runtime.

Redis 4.6 is causing CI warnings of deprecation

eg: #181

then I get on buildkite
https://buildkite.com/shopify/bloodhound/builds/10574#69a9c3dc-3685-4b17-b0b2-c6a3b960703e

Pipelining commands on a Redis instance is deprecated and will be removed in Redis 5.0.0. 61% Time: 00:02:00
--
  |  
  | redis.multi do
  | redis.get("key")
  | end
  |  
  | should be replaced by
  |  
  | redis.multi do \|pipeline\|
  | pipeline.get("key")
  | end
  |  
  | (called from /tmp/bundle/ruby/3.0.0/gems/ci-queue-0.21.1/lib/ci/queue/redis/base.rb:24:in `size'}
  | Pipelining commands on a Redis instance is deprecated and will be removed in Redis 5.0.0. 61% Time: 00:02:00
  |  
  | redis.multi do
  | redis.get("key")
  | end
  |  
  | should be replaced by
  |  
  | redis.multi do \|pipeline\|
  | pipeline.get("key")
  | end
  |  
  | (called from /tmp/bundle/ruby/3.0.0/gems/ci-queue-0.21.1/lib/ci/queue/redis/base.rb:24:in `size'}
  | Pipelining commands on a Redis instance is deprecated and will be removed in Redis 5.0.0.
  |  
  | redis.multi do
  | redis.get("key")
  | end
  |  
  | should be replaced by
  |  
  | redis.multi do \|pipeline\|
  | pipeline.get("key")
  | end

Have a way to disable requeues

This would make it easier to test for flakiness. Idea: add [ci no-requeue] to commit message but I am open to other ideas.

cc @gmalette

Automatically detect retried jobs

Currently some of our projects use the following wrapper script:

command = %w(bundle exec minitest-queue)
command << (ENV['BUILDKITE_RETRY_COUNT'].to_i > 0 ? 'retry' : 'run')

This is pretty much always the behavior you want, so minitest-queue and rspec-queue should behave like this by default if they detect the job was retried.

I need to check CircleCI & Travis-ci docs to know if they have a similar env var.

cc @jmreid @wvanbergen

Document usage

Or remove the TODO from the README if it plug-and-play.

Starscream shows successfully retried tests as errors

On otherwise good builds, retried tests show up as failures in Starscream.

Despite having been retried successfully.

Node 2:

tests/starscream/loaders/test_sqlalchemy_loaders.py::TestRedshiftLoaderIntegration::test_load_from_scratch_incremental[parquet] FAILED WILL_RETRY 
tests/starscream/loaders/test_sqlalchemy_loaders.py::TestRedshiftLoaderIntegration::test_load_from_scratch_incremental[parquet] SKIPPED

Node 3:

tests/starscream/loaders/test_sqlalchemy_loaders.py::TestRedshiftLoaderIntegration::test_load_from_scratch_incremental[json] PASSED

cc/ @Shopify/dev-acceleration

CI::Queue::Static and CI::Queue::File initialization is broken

After #34, you need to initialize the File and Static queue passing the file name or the test respectively and you also need to populate the queue with the list of tests. This seems contra-intuitive.

Maybe the initialize should only take the config and the populate the list of tests and add a .populate class method (or any other name) that does what the initialize used to do.

Also those queues require a configuration object that is not needed in most of its use cases. Making it optional, or adding a high level method as .populate that creates the configuration object would be great.

cc @casperisfine @wvanbergen @tjoyal

Nitpick:

The @index instance variable is only set after you call #populate so if you don't you get a "non-method error fetch for nil" in the to_a method and a ruby warning because the instance variable is not initialized. Maybe we should try to write warning free code so we could avoid this kind of error?

Sensible Defaults for Python Redis Client

By default no timeouts are set for the Redis client. It's poorly documented, but as far as I can tell, the Redis client will wait hang infinitely by default. This is not a sensible default for ci-queue clients, forcing every client to specify this value. socket_connect_timeout defaults to whatever is specified for socket_timeout, but since that is handled deep down in the Redis client and certainly not documented by ci-queue a responsible client might not rely on that.

I suggest explicitly setting sensible defaults for these two values to reduce the burden on clients.

ci-queue/python/ciqueue/_pytest/test_queue.py

Lines 59 to 81 in 3b7ac59

    
           def build_queue(queue_url, tests_index=None): 
        
               spec = uritools.urisplit(queue_url) 
        
               if spec.scheme == 'list': 
        
                   return ciqueue.Static(spec.path.split(':')) 
        
               elif spec.scheme == 'file': 
        
                   return ciqueue.File(spec.path) 
        
               elif spec.scheme == 'redis': 
        
                   redis_args = parse_redis_args(spec) 
        
                   redis_client = redis.StrictRedis(**redis_args) 
        
                   worker_args = parse_worker_args(spec.query, tests_index) 
        
                   retry = bool(worker_args['retry']) 
        
                   del worker_args['retry'] 
        
                   klass = ciqueue.distributed.Worker 
        
                   if tests_index is None: 
        
                       klass = ciqueue.distributed.Supervisor 
        
                   queue = klass(tests=tests_index, redis=redis_client, **worker_args) 
        
                   if retry and tests_index: 
        
                       queue = queue.retry_queue() 
        
                   return queue 
        
               else: 
        
                   raise "Unknown queue scheme: " + repr(spec.scheme)

Minitest workers won't print errors by default

The way we insert our custom error reports make it so that Minitest won't print anything anymore.

We should stay as close as possible to minitest default behavior if no other reporters are configured.

RSpec: failures are not acknowledged if a before hook fails

CI::Queue::Redis::ReservationError: "./spec/models/redacted_spec.rb[1:20]" is already reserved. You have to acknowledge it before you can reserve another one
--
  | /app/vendor/bundle/ruby/2.4.0/gems/ci-queue-0.11.1/lib/ci/queue/redis/worker.rb:117:in `reserve'
  | /app/vendor/bundle/ruby/2.4.0/gems/ci-queue-0.11.1/lib/ci/queue/redis/worker.rb:48:in `poll'
  | /app/vendor/bundle/ruby/2.4.0/gems/ci-queue-0.11.1/lib/rspec/queue.rb:347:in `block (2 levels) in run_specs'
  | /app/vendor/bundle/ruby/2.4.0/gems/rspec-core-3.6.0/lib/rspec/core/configuration.rb:1894:in `with_suite_hooks'
  | /app/vendor/bundle/ruby/2.4.0/gems/ci-queue-0.11.1/lib/rspec/queue.rb:346:in `block in run_specs'
  | /app/vendor/bundle/ruby/2.4.0/gems/rspec-core-3.6.0/lib/rspec/core/reporter.rb:79:in `report'
  | /app/vendor/bundle/ruby/2.4.0/gems/ci-queue-0.11.1/lib/rspec/queue.rb:343:in `run_specs'
  | /app/vendor/bundle/ruby/2.4.0/gems/rspec-core-3.6.0/lib/rspec/core/runner.rb:87:in `run'
  | /app/vendor/bundle/ruby/2.4.0/gems/rspec-core-3.6.0/lib/rspec/core/runner.rb:71:in `run'
  | /app/vendor/bundle/ruby/2.4.0/gems/rspec-core-3.6.0/lib/rspec/core/runner.rb:45:in `invoke'
  | /app/vendor/bundle/ruby/2.4.0/gems/ci-queue-0.11.1/exe/rspec-queue:4:in `<top (required)>'
  | /app/vendor/bundle/ruby/2.4.0/bin/rspec-queue:23:in `load'
  | /app/vendor/bundle/ruby/2.4.0/bin/rspec-queue:23:in `<top (required)>'

support for python 3.6

the python implementation of ciqueue should be made to work with python 3.6, as well as the current python 2.7. We should extend the test environment to cover both versions of python (tox may come in handy for this).

don't push tests if there are failures during collection

eg.
https://buildkite.com/shopify/starscream/builds/80219#9bbff676-e0e4-4ad2-b121-45c8584f82de
In this instance, we know that there are errors during collection, but we push them anyway. The tests can't run on the workers, but by pushing them to the queue, an reporter is fooled into thinking that they'll eventually complete.

More aggressively retry Redis commands

Acceptance Criteria

We need to go over all the commands we emit and make sure they are idempotent, otherwise retrying could result in a corrupted state, lost tests etc.
The redis gem has the necessary elements for that it's mostly just configuration.

Context

Sometimes our Redis server that handle the ci-queue workload experience a failover or some other availability issues.

When this happens it break builds even though it recovers pretty fast.

Examples

Error connecting to Redis on redacted.svc.cluster.local.:6379 (SocketError) (Redis::CannotConnectError)

./tmp/bundle/ruby/3.1.0/gems/redis-4.8.0/lib/redis/client.rb:162:in `call': MASTERDOWN Link with MASTER is down and replica-serve-stale-data is set to 'no'. (Redis::CommandError)

(that later one need to be better categorized by the redis gem though)

Solution

Ideally we'd be resilient to these small transient errors, this means retrying all or most commands and possibly waiting a bit before retrying. The redis gem has the necessary elements for that it's mostly just configuration.

However we need to go over all the commands we emit and make sure they are idempotent, otherwise retrying could result in a corrupted state, lost tests etc.

cc @ChrisBr

Separate populating the queue from working off the queue

I think I'd like to split up populating the queue, and working off the queue, so we can put a filter/order step in between. This allows us to remove tests we don't want to run, or order the list so the slowest ones get run first.

Retrieve list of tests. [Implementations: Minitest discovery, File, ...]
Annotate tests [e.g. Get statistics about every test case]
Filter/order based on annotations.
Populate to worker queue [Implementations: Memory, Redis]

Then, the existing worker code can work off this queue.

For this to work, we probably need to wrap every test case in a class, rather than representing it as a string, e.g. CI::Queue::Entry. This class would hold all annotation.
Annotations can come from the input list. This could be a more elaborate file format. For the minitest discovery, we could annotate test methods in the Ruby code.
Annotations can also come from external sources as a separate step. E.g. retrieve the duration data for all tests in the list from an external database.

@DazWorrall @casperisfine what do you think?

Timed out tests shouldn't be requeued

In the following scenario:

Worker A pop a test and get stuck until timeout is reached
Worker B steal it from worker A and work it off successfully.
Worker A finally complete the test with a failure.

Expected behavior:

Worker A should see that the test was already acknowledged by another worker and simply drop the result.

Current behavior:

The test is requeued by worker A

Discovered by @lugray

Split the gem in 3

Because both minitest and rspec integration are provided by the same gem, we can't define explicit dependencies in the gemspec, even though that would make testing much much easier.

I think we should split in 3 gems:

ci-queue: Keep all the queue implementations, depends on pretty much nothing.
minitest-queue: depends on ci-queue, a specific minitest and minitest-reporters.
rspec-queue: depends on ci-queue and a specific rspec version.

Re-add Python lint and 3.8 support

Errors in the python-tests job in the Tests workflow were preventing new versions of the Ruby gem from being released. To unblock the release, we stopped running the tests against Python 3.8 and removed the lint check with #170. Support for Python 3.8 and the lint check should be re-added.

Hard crash on test failures with UTF-8 characters in test names

I believe this gist should help reproduce this error: https://gist.github.com/mrsimo/9cfc750719a7230c2a96803a4c861042

I added a puts inside the redis gem just before executes a command, hence the redis output we see:

$ be minitest-queue retry --build wadus-1 --queue $queue test.rb
["*2", "$4", "auth", "$32", <zip>"]
["*4", "$6", "lrange", "$27", "build:wadus-1:worker::queue", "$1", "0", "$2", "-1"]
["*4", "$4", "hset", "$27", "build:wadus-1:error-reports", "$57", "BugTest#test_doesn’t_allow_requests_from_inactive_users", "$333", "\x04\b{\b:\x19test_and_module_nameI\">BugTest#test_doesn\xE2\x80\x99t_allow_requests_from_inactive_users\x06:\x06ET:\x0Etest_nameI\"6test_doesn\xE2\x80\x99t_allow_requests_from_inactive_users\x06;\x06T:\voutputI\"\x01\x9F\e[31mFAIL\e[0m BugTest#test_doesn\xE2\x80\x99t_allow_requests_from_inactive_users\n\e[33mExpected true to not be truthy.\e[0m\n    test.rb:10:in `block in <class:BugTest>'\n\n\x06;\x06T"]
Traceback (most recent call last):
	35: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/minitest-5.11.3/lib/minitest.rb:63:in `block in autorun'
	34: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/minitest-5.11.3/lib/minitest.rb:136:in `run'
	33: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/ci-queue-0.13.4/lib/minitest/queue.rb:145:in `__run'
	32: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/ci-queue-0.13.4/lib/minitest/queue.rb:156:in `run_from_queue'
	31: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/ci-queue-0.13.4/lib/ci/queue/static.rb:52:in `poll'
	30: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/ci-queue-0.13.4/lib/minitest/queue.rb:177:in `block in run_from_queue'
	29: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/minitest-5.11.3/lib/minitest.rb:802:in `record'
	28: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/minitest-5.11.3/lib/minitest.rb:802:in `each'
	27: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/minitest-5.11.3/lib/minitest.rb:803:in `block in record'
	26: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/minitest-reporters-1.3.0/lib/minitest/minitest_reporter_plugin.rb:20:in `record'
	25: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/minitest-reporters-1.3.0/lib/minitest/minitest_reporter_plugin.rb:20:in `each'
	24: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/minitest-reporters-1.3.0/lib/minitest/minitest_reporter_plugin.rb:21:in `block in record'
	23: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/ci-queue-0.13.4/lib/minitest/queue/build_status_recorder.rb:52:in `record'
	22: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/ci-queue-0.13.4/lib/ci/queue/redis/build_record.rb:20:in `record_error'
	21: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis.rb:2303:in `pipelined'
	20: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis.rb:45:in `synchronize'
	19: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/2.5.0/monitor.rb:226:in `mon_synchronize'
	18: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis.rb:45:in `block in synchronize'
	17: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis.rb:2307:in `block in pipelined'
	16: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:153:in `call_pipeline'
	15: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:291:in `with_reconnect'
	14: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:155:in `block in call_pipeline'
	13: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:185:in `call_pipelined'
	12: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:218:in `process'
	11: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:304:in `logging'
	10: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:219:in `block in process'
	 9: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:366:in `ensure_connected'
	 8: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:220:in `block (2 levels) in process'
	 7: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:220:in `each'
	 6: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:226:in `block (3 levels) in process'
	 5: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:267:in `write'
	 4: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:248:in `io'
	 3: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/client.rb:269:in `block in write'
	 2: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/connection/ruby.rb:356:in `write'
	 1: from /home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/connection/command_helper.rb:30:in `build_command'
/home/albert/.rbenv/versions/2.5.1/lib/ruby/gems/2.5.0/gems/redis-4.0.1/lib/redis/connection/command_helper.rb:30:in `join': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)

If I change the ’ from doesn’t things go well.

I can hopefully just change all these in my test suite, but it seems we have very enthusiastic developers when comes to proper apostrophes in our team 😆 and we'll just be one ’ away from test catastrophe.

Ruby ci-queue requires gems that are not runtime dependencies

We are getting LoadError: cannot load such file errors from the ci-queue ruby gem when trying to use it. It looks to me that the gem specifies several development_dependencies but uses them outside of development at runtime.

The fix on our end is to require this gems dependencies in our gemfile. The gem should properly specify its dependencies so installing it via bundler will grab everything it needs.

Here are the errors I've encountered:

bundler: failed to load command: minitest-queue (/tmp/bundle/ruby/2.7.0/bin/minitest-queue)
--
  | LoadError: cannot load such file -- redis
  | /tmp/bundle/ruby/2.7.0/gems/ci-queue-0.20.5/lib/ci/queue/redis.rb:3:in `require'
  | /tmp/bundle/ruby/2.7.0/gems/ci-queue-0.20.5/lib/ci/queue/redis.rb:3:in `<top (required)>'
  | /tmp/bundle/ruby/2.7.0/gems/ci-queue-0.20.5/lib/ci/queue.rb:47:in `require'
  | /tmp/bundle/ruby/2.7.0/gems/ci-queue-0.20.5/lib/ci/queue.rb:47:in `from_uri'
  | /tmp/bundle/ruby/2.7.0/gems/ci-queue-0.20.5/lib/minitest/queue/runner.rb:31:in `run!'
  | /tmp/bundle/ruby/2.7.0/gems/ci-queue-0.20.5/lib/minitest/queue/runner.rb:19:in `invoke'
  | /tmp/bundle/ruby/2.7.0/gems/ci-queue-0.20.5/exe/minitest-queue:5:in `<top (required)>'
  | /tmp/bundle/ruby/2.7.0/bin/minitest-queue:23:in `load'
  | /tmp/bundle/ruby/2.7.0/bin/minitest-queue:23:in `<top (required)>'

bundler: failed to load command: minitest-queue (/tmp/bundle/ruby/2.7.0/bin/minitest-queue)
--
  | LoadError: cannot load such file -- minitest/reporters
  | /tmp/bundle/ruby/2.7.0/gems/ci-queue-0.20.5/lib/minitest/queue.rb:4:in `require'
  | /tmp/bundle/ruby/2.7.0/gems/ci-queue-0.20.5/lib/minitest/queue.rb:4:in `<top (required)>'
  | /tmp/bundle/ruby/2.7.0/gems/ci-queue-0.20.5/lib/minitest/queue/runner.rb:4:in `require'
  | /tmp/bundle/ruby/2.7.0/gems/ci-queue-0.20.5/lib/minitest/queue/runner.rb:4:in `<top (required)>'
  | /tmp/bundle/ruby/2.7.0/gems/ci-queue-0.20.5/exe/minitest-queue:4:in `require'
  | /tmp/bundle/ruby/2.7.0/gems/ci-queue-0.20.5/exe/minitest-queue:4:in `<top (required)>'
  | /tmp/bundle/ruby/2.7.0/bin/minitest-queue:23:in `load'
  | /tmp/bundle/ruby/2.7.0/bin/minitest-queue:23:in `<top (required)>'

Failure type in JUnit XML makes no sense

I think this is causing the issue where skipped tests are reported in the buildkite annotation, instead of any actual failures:

<testsuite name="ShopSearchIndexingJobTest" filepath="components/platform/test/unit/jobs/shop_search_indexing_job_test.rb" skipped="1" failures="0" errors="0" tests="1" assertions="0" time="0.06969359999993685">
  <testcase name="test_update_shop_documents_when_shop's_active_field_changes" lineno="30" classname="ShopSearchIndexingJobTest" assertions="0" time="0.06969359999993685" flaky_test="true">
    <failure type="test_update_shop_documents_when_shop's_active_field_changes" message="This test is very slow: [url]">
Skipped: test_update_shop_documents_when_shop's_active_field_changes(ShopSearchIndexingJobTest) [/app/components/platform/test/unit/jobs/shop_search_indexing_job_test.rb:31]: This test is very slow: [url]
    </failure>
  </testcase>
</testsuite>

This was a minitest skipped test (as in: skip at the top of the test definition, not testability magic). 2 things:

The failure type makes no sense (it's the test name)
Possibly because of the above, you can't tell from the XML that is was skipped. It looks like a regular failure, so our parsing logic can't ignore these

cc @Shopify/pipeline

Q: Is ordering of tests in the queue based on previous timing statistics?

Hello!

We're in the process of evaluating ci-queue for our RSpec suite and I'd like to know: is the timing of a previous example run taken into account when scheduling the example in future builds? E.g. "schedule slower examples first".

Thanks in advance!

Test summary does consider results from retries

As discussed on Slack: https://shopify.slack.com/archives/C9BPSMUDC/p1521483816000555

When a failing (flaky) test turns green after a retry, the test results always keep failing on retry. The failing tests returned are the one before the retry, so the only way to get a passing build is to retry the entire suite 🔥💰

rspec-queue does not fail if there are NameErrors in specs

I've just encountered the following behaviour with the rspec runner of ci-queue:

Modify a spec to reference an uninitialized constant so as to trigger a NameError, e.g. by referring to a non-existent class:

describe MyNonExistentClass do
 ...
end

Run rspec-queue
rspec-queue exits successfully.

I guessed wrongly that rspec-queue would fail with a non-zero exit code, but it exited with 0.

Bare rspec exits with 1 in the above situation.

I'm wondering if this is intended behaviour - perhaps its needed for retries to work?

The error message output is:

NameError: uninitialized constant MyNonExistentClass

Debugging shows that line 400 is where the 0 exit code is returned, because the NameError causes @world.non_example_failure to be true. Syntax errors appear to have the same result:

ci-queue/ruby/lib/rspec/queue.rb

Lines 387 to 401 in cba2cdd

    
           success = true 
        
           @configuration.reporter.report(examples_count) do |reporter| 
        
             @configuration.add_formatter(BuildStatusRecorder) 
        
             @configuration.with_suite_hooks do 
        
               break if @world.wants_to_quit 
        
               queue.poll do |example| 
        
                 success &= example.run(QueueReporter.new(reporter, queue, example)) 
        
                 break if @world.wants_to_quit 
        
               end 
        
             end 
        
           end 
        
           return 0 if @world.non_example_failure 
        
           success ? 0 : @configuration.failure_exit_code

pytest: TestCase failures reported as passing

With this test suite:

# test_example.py
from unittest import TestCase

class TestExample(TestCase):
    def test_fail(self):
        assert False

Running works as expected:

$ pytest -p ciqueue.pytest --queue 'redis://localhost:6379/0?worker=0&build=repro&retry=0&timeout=10' test_example.py
========================================================= test session starts =========================================================
platform darwin -- Python 2.7.13, pytest-4.0.1, py-1.7.0, pluggy-0.6.0
...
collected 1 item

test_example.py F

============================================================== FAILURES ===============================================================
________________________________________________________ TestExample.test_fail ________________________________________________________

self = <test_example.TestExample testMethod=test_fail>

    def test_fail(self):
>       assert False
E       AssertionError: assert False

test_example.py:5: AssertionError
====================================================== 1 failed in 0.06 seconds =======================================================

However, reporting shows that the test passed:

$ pytest -p ciqueue.pytest_report --queue 'redis://localhost:6379/0?worker=0&build=repro&retry=0&timeout=10' test_example.py
========================================================= test session starts =========================================================
platform darwin -- Python 2.7.13, pytest-4.0.1, py-1.7.0, pluggy-0.6.0
...
collected 1 item

test_example.py .                                                                                                               [100%]

====================================================== 1 passed in 0.02 seconds =======================================================

This appears to be due to ciqueue's makereport triggering before the one in pytest which is responsible for setting call.excinfo for TestCase-style tests.

From these docs it looks like:

the ordering of multiple tryfirst=True hooks is undefined
hookwrapper=True can execute code before a tryfirst hook

So, one fix could be to get pytest to switch to using hookwrapper. I figured I'd raise it with you folks first, though, in case it's something that's easier to fix on this end (by making ciqueue execute later, somehow).

DRY redis setup in ci-queue tests

See #113 (comment)

Update ciqueue for pytest

ciqueue for pytest got somewhat outdated:

no support for Python >= 3.8 (there's already 3.10)
supports 'pytest>=2.7,<=3.5.0', while there's already pytest 7.1.2

Support multiple `-I` flags in `minitest-queue`

👋 This would be a nice to have, as it would more closely mirror how the ruby cli operates. In the cli I can do something like this:
ruby -Ilib -Itest ./test/some_test
which is equivalent to:
ruby -Ilib:test ./test/some_test

For some command like this:

bundle exec minitest-queue run --queue redis://maestro-service-prototype.railgun  --max-requeues 3 --requeue-tolerance 0.01 --timeout 45 -Ilib -Itest ./test/maestro_service/conductor_test.rb

The gem will only respect the last -I flag. It would be nice to have it allow multiple -I flags, or maybe output some warning if it sees multiple of the same flag.

Q: Smallest unit of work in an RSpec suite

Apologies if this is not the place to ask questions, but I couldn't find a better one (Discussions would be a great fit here).

I have a few questions regarding how ci-queue splits tests in an RSpec test suite.

is the smallest unit of work an RSpec example (i.e. an it block)? In other words, are individual items in the queue just specific examples? Or maybe is it whole example groups (i.e. describe with multiple it blocks inside)?
if so, I guess that you've found it to be more optimal as opposed to having whole spec files be the smallest unit of work. Is that so? In particular, I was thinking that scheduling individual examples might perform worse than scheduling a whole spec file. For example, consider the following scenario:

describe "foo" do
   # a very expensive factory
  let!(:car) { Factory.create(:car) }

  it "works" do
    # some tests...
  end

  it "moves" do
   # some tests...
  end
end

(cont.) In that case above, I assume that, due to Rails' transactional tests, scheduling the two it blocks in different workers might be less optimal than scheduling the whole describe block (or whole file) in a single worker. That's because in the latter case, the expensive factory would be created once, while in the former case it'll be created two times (one in each worker).

In general, I'm interested in any insights you might have regarding the optimal way to split a large RSpec suite, with regards to the "smallest unit of work".

Thanks in advance!

Expire old test runs

Looking at our graphs of redis memory usage, it looks like this gem only removes keys when it hits the memory limit of redis.

This is causing us some problems, as the redis SaaS we are using has a total memory limit for all our instances, and our db for the ci-test-queue using all the memory it can use means other databases have less memory available than they need to operate properly.

I thought I would investigate whether there was a way to clear out old test runs from the ci-queue.
If this isn't currently possible, I'd be happy to put in the work to make it possible. My thought would be to add an optional ttl to the redis commands storing data, but I'd be interested in hearing alternative approaches.

We're using the rubygem version 0.16.0.

Remove Ruby 2.7 support

Support for Ruby 2.7 is ended, so it can be removed from CI and the gem

Feature request: Have the rspec-queue executable output the queue size

Since ci-queue cannot by necessity respect before(:all) hooks, I have a workaround to run the required setup in my CI script. However, I'd like to avoid running it if the queue is empty (i.e. other workers have already exhausted it), so it would be cool to have something like:

unless `rspec-queue --query queue-size`.chomp! == '0'
  # do expensive setup
end

If there's already a straightforward programmatic way to do this, I'm happy to use that instead of adding a feature!

Project root gets greedily stripped from failure backtraces in JUnit Reporter

Looks like this gsub is removing all the project roots to make paths relative, but causes issues when the project root path/name is also within a nested subdirectory:

ci-queue/ruby/lib/minitest/queue/junit_reporter.rb

Line 104 in 9396660

    
           message_with_relative_paths = error.message.gsub("#{Minitest::Queue.project_root}/", '')

Take this example path in a backtrace: components/fooservices/foo/bar.rb. It should be components/foo/app/services/foo/bar.rb. The app/ project root is being removed from within paths when it should just be at the start.

We probably have to iterate over each line in backtrace and only remove it from the start of the line? Or use the relative path helpers.

cc @wvanbergen

Add python support to ci-queue

It would be great if ci-queue supported testing in other languages/frameworks, such as python's pytest.
The current way we get python tests to run tests off a ci queue is to load the queue with tests using the following command:

py.test -q --collect-only

and use pytest_redis to pull the tests off the queue and run them

py.test -vv --junitxml=pytest.xml -p pytest_redis \
           --redis-host=#{redis_host} \
           --redis-port=#{redis_port} \
           --redis-list-key=#{redis_list_key} \
           --redis-backup-list-key=#{redis_backup_list_key}"

cc/ @Shopify/byroot @casperisfine

Do you use this in conjunction with Rails 6 parallel tests?

👋 This gem looks really interesting! Forgive me for asking what might be a dumb question...

I have an application whose test suite I've parallelized in two levels. First, the test files are split up across a dozen or two containers (using circleci tests glob "test/system/**/*_test.rb" | circleci tests split --split-by=timings). Second, each container runs its test using a few parallel processes (via the parallelize method from Rails 6). The second level helps to further reduce the overall runtime and can parallelize individual tests within the largest test files.

This has worked pretty well, but sometimes one container is just really slow to start up so we end up with the job waiting on that one container when all the rest have already finished. Here's a particularly egregious example:

This has led me to look into queueing solutions that would minimize the effects of one slow container, and I happened to find this one, which looks promising. My question is: if I were to adopt this, would I still be able to leverage the second level of Rails' multiple-process parallelism within each container, or are there some incompatibilities that prevent that?

	def build_queue(queue_url, tests_index=None):
	spec = uritools.urisplit(queue_url)
	if spec.scheme == 'list':
	return ciqueue.Static(spec.path.split(':'))
	elif spec.scheme == 'file':
	return ciqueue.File(spec.path)
	elif spec.scheme == 'redis':
	redis_args = parse_redis_args(spec)
	redis_client = redis.StrictRedis(**redis_args)

	worker_args = parse_worker_args(spec.query, tests_index)
	retry = bool(worker_args['retry'])
	del worker_args['retry']

	klass = ciqueue.distributed.Worker
	if tests_index is None:
	klass = ciqueue.distributed.Supervisor
	queue = klass(tests=tests_index, redis=redis_client, **worker_args)
	if retry and tests_index:
	queue = queue.retry_queue()
	return queue
	else:
	raise "Unknown queue scheme: " + repr(spec.scheme)

	success = true
	@configuration.reporter.report(examples_count) do \|reporter\|
	@configuration.add_formatter(BuildStatusRecorder)

	@configuration.with_suite_hooks do
	break if @world.wants_to_quit
	queue.poll do \|example\|
	success &= example.run(QueueReporter.new(reporter, queue, example))
	break if @world.wants_to_quit
	end
	end
	end

	return 0 if @world.non_example_failure
	success ? 0 : @configuration.failure_exit_code