mhjort / trombi Goto Github PK

Load testing library for testing stateful apps with Clojure

License: Eclipse Public License 1.0

Clojure 99.81% Makefile 0.15% Shell 0.04%

trombi's Introduction

trombi

Conduct load tests with Clojure, obtaining results as Clojure data or graphical charts. This tool targets stateful applications needing scripted logic in tests.

Installation

Add the following to your project.clj :dependencies:

This tool name used to be clj-gatling. The project started as a simple wrapper to Gatling. However, today the library has evolved into a separate tool with its own design. Gatling Highcharts is now only an optional dependency that is used for generating graphical results. So the name was misleading and project was renamed to Trombi (It is Finnish word meaning Tornado).

Note for clj-gatling users! trombi is backwards compatible and you can still run the existing tests without any code change. However, trombi does not include all the dependencies by default anymore. See section Additional Dependencies.

Usage

Trivial example

This will make 100 simultaneous http get requests (using http-kit library) to localhost server. A single request is considered to be ok if the response http status code is 200.

(require '[trombi.core :as trombi])
(require '[org.httpkit.client :as http])

(defn localhost-request [_]
  (let [{:keys [status]} @(http/get "http://localhost")]
    (= status 200)))

(trombi/run
  {:name "Simulation"
   :scenarios [{:name "Localhost test scenario"
                :steps [{:name "Root"
                         :request localhost-request}]}]}
  {:concurrency 100})

Running the simulation prints some statistics to stdout and returns a result like this:

{:ok 90 :ko 10 :response-time {:global {:min 20
                                        :max 500
                                        :mean 154}}}

Where ok means number of succesful requests and ko number of failed ones. Response times are in milliseconds.

If you want to get see a graphical report instead you can call trombi with additional options and add additional deps. (See sections Reporters and Additional Dependencies for more details)

(require '[trombi-gatling-highcharts-reporter.core])
(trombi/run your-simulation {:concurrency 100 :reporters [trombi-gatling-highcharts-reporter.core/reporter]})

This call will use Gatling Highcharts reporter and will generate a graphical report. Location of the report is returned and can also be found from stdout output.

Calling run function will block while simulation is running. If you want to more control you can also call run-async function. It takes same parameters as the synchronous call. However, it returns immediately and returns map with following keys:

results: A promise that is delivered once the simulation finishes.
force-stop-fn: Function that stops the execution of the simulation. Function does not take any parameters. Stopping does not kill scenarios/requests that are in progress. They will be finished before the exit When simulation is force stopped trombi does not guarantee that results are very reliable. So it is better to ignore the results when you finish the simulation with force stop.

Concepts

trombi runs simulations to simulate load. A simulation consists of one or multiple scenarios that will be run in parallel. One scenario contains one or multiple steps that are run sequentially. One simulation is configured to run with a given number of virtual users or with given rate of new virtual users per second. As a result the tool returns response times (min, max, average, percentiles) and requests per second. Internally millisecond is used as a precision for benchmarks. Therefore this is not suited for testing systems with less than one millisecond response times.

A simulation is specified as a Clojure map like this:

{:name "Simulation"
 :pre-hook (fn [ctx] (do-some-setup) (assoc ctx :new-value value)) ;Optional
 :post-hook (fn [ctx] (do-some-teardown)) ;Optional
 :scenarios [{:name "Scenario1"
              :context ;Optional (default {})
              :weight 2 ;Optional (default 1)
              :skip-next-after-failure? false ;Optional (default true)
              :allow-early-termination? true ;Optional (default false)
              :pre-hook (fn [ctx] (scenario-setup) (assoc ctx :new-val value)) ;Optional
              :post-hook (fn [ctx] (scenario-teardown)) ;Optional
              :step-fn ;Optional. Can be used instead of list of steps
              :steps [{:name "Step 1"
                       :request step1-fn}
                      {:name "Step 2"
                       :sleep-before (constantly 500) ;Optional
                       :request step2-fn}]}
             {:name "Scenario2"
              :weight 1
              :steps [{:name "Another step"
                       :request another-step-fn}]}]}

Global simulation Hooks

You can define a pre-hook function that is executed once before running the simulation. Function takes in the context map. You can change the context (e.g. by adding new keys) by returning new map. Also you can define a post-hook function which is called after running the simulation.

Scenarios

You can define one or multiple scenarios. Scenarios are always run in parallel. Concurrent users are divided between the scenarios based on their weights. For example:

Simulation concurrency: 100
Scenario1 with weight 5 => concurrency 80
Scenario2 with weight 1 => concurrency 20

Scenario weight is optional key with default value 1. In that case the users are split evenly between the scenarios.

Scenarios are also able to specific their own additional context via the optional :context key.

Scenario steps

Each scenario consists of one or multiple steps. Steps are run always in sequence. Step has a name and user specified function (request) which is supposed to call the system under test (SUT). Function takes in a scenario context as a map and has to return either directly as a boolean or then with core.async channel with a message of type boolean.

;;Returning boolean directly
(defn request-returning-boolean [context]
  ;;Call the system under test here
  true) ;Return true/false based on the result of the call

;;Returning core.async channel
(defn request-returning-channel [context]
  (go
     ;;Call the system under test here using non-blocking call
     true)) ;Return true/false based on the result of the non-blocking call

The latter is the recommended approach. When you use that it makes trombi able to utilize (=share) threads and makes it possible to generate more load with one machine. However, the former is probably easier to use at the beginning and is therefore a good starting point when writing your first trombi tests.

If the function returns a false trombi counts the step as a failed. If a function throws an exception it will be considered as a failure too. trombi also provides a global timeout (see Options) for a step. If a request function takes more time it will be cancelled and step is again considered as a failure.

Note! trombi reports only step failures and successes. At the moment there is no support for different kinds of errors in reporting level. All errors are logged to target/results/<sim-name>/errors.log.

Context map contains all values that you specified as a simulation context when calling run method (See options) and trombi provided user-id value. The purpose of user-id is to have a way to specify different functionality for different users. For example:

(defn request-fn [{:keys [user-id]}]
  (go
    (if (odd? user-id)
      (open-document-a)
      (open-document-b-))))

If your scenario contains multiple steps you can also pass values from a step to next a step inside an scenario instance (same user) by returning a tuple instead of a boolean.

;;step 1
(defn login [context]
  (go
    (let [user-name (login-to-system)]
      [true (assoc context :user-name user-name)])))

;;step 2
(defn open-frontpage [context]
  (go
    (open-page-with-name (:user-name context))))

If you want a step to not launch immediately after previous step you can specify a step key sleep-before. The value for that key is a user defined function that takes in the scenario context and has to return number of milliseconds to wait before starting request function for that step.

By default trombi won't call the next step in scenario if a previous step fails (returns false). You can override this behaviour by setting skip-next-after-failure? to false at scenario level.

When trombi terminates the simulation (either after the given duration or given requests) all running scenarios will still finish. If scenario has multiple steps and takes long to run the simulation may take some time to fully terminate. If you want to disable that feature in scenario level you can set allow-early-termination? to true.

Scenario hooks

Scenario 'pre-hook' function is executed before running a scenario with single virtual user. Scenario 'post-hook' function is executed after the scenario with user has finished. Post-hook will always be executed (even when previous step fails).

Dynamic scenarios

Sometimes a pre-determined sequence of steps does not provide enough flexibility to express the test scenario. In such a case, you may provide the key step-fn instead with a function taking the current context and returning a tuple specifying a step and a modified context. Returning a nil step marks the end of the scenario.

Note! If step-fn never returns nil the simulation will run endlessly. To prevent that you can use option :allow-early-termination?

Options

Second parameter to trombi.core/run function is options map. Options map contains following keys:

{:context {:environment "test"} ;Context that is passed to user defined functions. Defaults to empty map.
 :timeout-in-ms 3000 ;Timeout for a request function. Defaults to 5000.
 :root "/tmp" ;Directory where cl-gatling temporary files and final results are written. Defaults to "target/results".
 :concurrency 100 ;Number of concurrent users trombi tries to use. Default to 1.
 :concurrency-distribution ;Function for defining how the concurrent users are distributed during the simulation. Optional.
 :rate 100 ;Number of new requests to add per second. Note! If rate is given, concurrency value will be ignored.
 :rate-distribution ;Function for defining how the rate is adjusted during the simulation. Optional.
 :progress-tracker ;Function used for tracking simulation progress. Optional.
 :reporters ;List of reporters to use. Optional. If omitted short summary reporter and highchart reporter are used.
 :requests 10000 ;Total number of requests to run before ending the simulation. Defaults to the number of steps in simulation.
 :duration (java.time.Duration/ofMinutes 5) ;The time to run the simulation. Note! If duration is given, requests value will be ignored.
 :error-file "/tmp/error.log"} ;The file to log errors to. Defaults to "target/results/<sim-name>/error.log".

Ramp-up

If you only set the concurrency the trombi will use same concurrency from the beginning till end. If you want to have more control for that (for example ramp-up period) you can specify your own concurrency distribution function. The concurrency and rate distribution functions both have a legacy (version < 0.17.0) and new possible format.

In legacy mode, when the user-provided function is binary (2-arity), your function will be called with:

The progress through the simulation (as defined by either duration or requests), as a floating point number that goes from 0.0 to 1.0.
The scenario-level context.

e.g.

(fn [progress context]
   (if (< progress 0.1)
      0.1
      1.0))

In new mode, the user-provided function should be unary (1-arity). The single argument provided will be a map, which can be destructured at will, and allows extensibility for new arguments without breaking backwards compatability. Currently, the provided keys are:

progress: The percentage progress through the simulation (as defined by either duration or requests), as a floating point number that goes from 0.0 to 1.0.
duration: The elapsed time the simulation has been running for.
context: The scenario-level context.

e.g.

(fn [{:keys [progress duration context]}]
   (if (< (.toSeconds duration) 10)
      0.1
      1.0))

Your distribution function should return a floating point number from 0.0 to 1.0. The concurrency/rate at that point in time will be the requested concurrency/rate multiplied by the returned number.

Progress tracker

By default, trombi will write the progress periodically (every 200 milliseconds) to console output.

If you want to disable this functionality you can specify option :progress-tracker (fn [_]).

Following keys are passed to progress tracker function:

progress: Progress as a floating point number between 0.0 and 1.0.
sent-requests: Number of requests sent so far
total-concurrency: How many concurrent requests are in progress at the moment
default-progress-tracker: Function for default behaviour. This can be used to also call the default tracker from user-provided progress tracker
force-stop-fn: Function that stops the execution of the simulation. Function does not take any parameters. Stopping does not kill scenarios/requests that are in progress. They will be finished before the exit.

e.g.

(fn [{:keys [progress sent-requests total-concurrency default-progress-tracker force-stop-fn] :as params}]
  (println "Progress:" progress ", sent requests:" sent-requests ", total concurrency:" total-concurrency)
  (default-progress-tracker params) ;Call default behaviour
)

Tuning parallelism

Internally trombi uses core.async, which has a fixed size thread pool. For load test scripts that use a high performance, asynchronous, non-blocking I/O (e.g. http-kit) library this is not a big issue. However, for libraries that require a thread per get request (e.g. clj-http) this is a real limitation.

The latest version of core.async supports setting the thread pool size using system property clojure.core.async.pool-size. With that the thread pool could be set to match the concurrency used in the simulation.

Reporters

By default, trombi generates one report which is a short summary report.

When you call the trombi/run function it will return all the reports with reporter keys (e.g :short for the short summary reporter). Reporters also generate report to stdout and some reporters may generate results to file.

If you don't want to use the default reports you can specify a list of reporters with the :reporters key in the options. Available reporters are following:

trombi.reporters.short-summary/reporter This reporter returns a summary with number of successful and failed requests. In addition to that global min, max and mean is reported.
trombi.reporters.raw-reporter/in-memory-reporter This reporter returns all the raw results (scenarios & requests with their start and end times). It stores results in memory.
trombi.reporters.raw-reporter/file-reporter This reporter returns all the raw results (scenarios & requests with their start and end times). It stores results to file.
trombi-gatling-highcharts-reporter.core/reporter Generates a Gatling Highchart html report (See Additional Dependencies).

You can also specify your own custom reporter. Check https://github.com/mhjort/trombi/blob/master/src/trombi/reporters/raw_reporter.clj as an example.

Additional dependencies

Trombi by default uses only few libraries. However, it supports some features that might need additional dependencies.

If you want to use Gatling Highcharts reporter you have to include the depenency [com.github.mhjort/trombi-gatling-highcharts-reporter "1.0.0"].

If you want to use http-kit as your http client library you have to include the dependency [http-kit "2.6.0"].

Note that in trombi clj-time dependency is not included anymore because from Java SE 8 onward, users are asked to migrate to java.time (JSR-310) However, trombi is backwards compatible and if you still want to use clj-time please add [clj-time "0.15.2"] as a dependency.

Examples

See example project here: metrics-simulation

Here is a presentation on how to test stateful applications with Trombi

Tuning the test runner

In load testing the goal is to generate load to the system under the test. However, sometimes the test runner can be also a bottleneck. trombi has been built on the idea that the request functions should be non-blocking. This way the test runner does not need to use that many threads and it is possible to generate huge amount of requests from the single machine. To track this behaviour there is now an experimental support for tracking active thread count in test client. By setting :experimental-test-runner-stats? true you can get statistics about thread count during the test simulation. In the end trombi will produce following output to console:

Test runner statistics: {:active-thread-count {:average 30, :max 33}}

In general these numbers should be lower than the number of concurrency in the simulation. And when increasing the concurrency these numbers should not increase accordingly.

Change History

Note! Version 0.8.0 includes a few changes on how simulations are defined. See more details in change history. All changes are backwards compatible. The old way of defining the simulation is still supported but may be deprecated in future. You can see documentation for old versions here.

Jenkins

This is compatible with Jenkins Gatling Plugin. https://wiki.jenkins.io/display/JENKINS/Gatling+Plugin#GatlingPlugin-Configuration

Why

AFAIK there are no other load testing tool where you can specify your tests in Clojure and which is suitable for testing large stateful applications. In my opinion Clojure syntax is very good for this purpose.

Design Philosophy

Real life scenarios

Idea is to test the system/application by simulating how multiple users access the application concurrently. After the simulation you get comprehensive results: both as Clojure data and nice graphs if you want.

If you want to just test single web page with huge amounts of requests simpler tools like Apache Benchmark can do that. Of course, trombi can do that also but it might be an bit of an overkill for that job.

No DSL

I am not a fan of complex DSLs. trombi tries to avoid DSL approach. Idea is that you should write just ordinary Clojure code to specify the actions and the actual scenario definition is an map.

Distributed load testing

Clojider is a experimental tool that can run trombi in a distributed manner. It uses AWS Lambda technology for running distributed load tests in the cloud.

Contribute

Use GitHub issues and Pull Requests.

Development

By default Leiningen is used for development. There is also experimental support for deps.edn (See Makefile).

License

Distributed under the Eclipse Public License, the same as Clojure.

trombi's People

Contributors

Stargazers

Watchers

Forkers

olecve lokori juhakaremo bigkahuna1uk cka3 robert-stuttaford vitallabs greywolve johan-sports piguy79 truemped rlefebvre petterik clojure-land mvreri andreacrotti griffinbank jmartinezseven

trombi's Issues

Passing users list to core/run

simulation/run accepts users as an option, but core/run does not pass it. Is this just overlooked or maybe a design choice?

Use non-blocking version of alts!! when running scenarios constantly

When running scenarios constantly clj-gatling uses this.

   (defn- collect-result-and-run-next [cs run]                                                                                            
      (let [[result c] (async/alts!! cs)
         ret @result]
       (go (>! c (run)))                                                                                                                  
       ret))

The blocking version of alts!! is not optimal for this case and because of this clj-gatling cannot generate very high constant load. This part should be reimplemented in a non-blocking way.

Probable mistake in readme.md at `(assoc :user-name context)`

Maybe I am wrong, but the following seems weird to me:

;step 1
(defn login [context]
  (go
    (let [user-name (login-to-system)]
      [true (assoc :user-name context)])))

shouldn't it be:

;step 1
(defn login [context]
  (go
    (let [user-name (login-to-system)]
      [true (assoc context :user-name user-name)])))

End simulation when total sent requests equal given number of requests

Current implementation splits total number of requests between scenarios.
For example 66 requests for scenario1 and 34 for scenario2. When running
that kind of simulation scenario2 often finishes a lot earlier and then
simulation continues with only one scenario. That's rarely what the user
wants.

The better way is to just split concurrency between scenarios and then
run them both until total number of requests for simulation has been
fulfilled.

Support for longer running load test scenarios

At the moment clj-gatling reads all the intermediate results into memory. When actual load test run is over it writes results to event log files. Gatling parser reads those files and generates the final reports.

For short running scenarios (less than an hour) this model works pretty nicely. However, if you run your test for more than hour clj-gatling consumes gigabytes of memory and eventually you will get OutOfMemoryException.

Gatling should support parsing multiple event log files. So we should batch the results to multiple event log files while the load test is running. Say for example after 1 million items write them to file and free the memory.

Add timeout for custom functions

Clj-gatling generates "hidden" requests that are not shown in report

It was easier to implement the load test run in a way that there are core.asyng go-loops that just make requests all the time. When either given duration or total number of requests has passed we stop the run by not reading any more responses. This leads into a situation that there are sometimes few requests that will be cancelled but we do not check their responses any more and they will not end up in the report.

Originally I thought that this is so small issue (like 100000 requests and few extra ones that are not handler) that this does not matter. However, I've noticed that in some cases I want to know exactly how many requests were sent. For example now I am performance testing our analytics solution. I want to send exactly 100000 requests, check the performance and verify our analytics backend has exactly 100000 events.

Allow for retries of scenario steps

It would be nice to be able to specify scenario steps to be retried at the scenario level and keep track of the failures. The current method I'm using uses a retry inside the called function but it's not apparent in the result set that there was a failure and a retry happened it just looks like a really long request.

Kill a simulation?

If I run a long (and possibly buggy) simulation from the repl, I can't really kill it without restarting the repl entirely.

I think ideally sending a cider-interrupt or equivalent should kill the whole thing, which I guess doesn't work out of the box because it's using heavily core.async under the hood right?

Upgrade http-kit for JDK 9

It seems that the project uses a version of http-kit that is incompatible with JDK 9 and above. It is fixed upstream and upgrading it will help in JDK 9 compatibility.

Relevant http-kit upstream issue : http-kit/http-kit#356
Relevant release : https://github.com/http-kit/http-kit/releases/tag/2.3.0-beta2
Latest RC release : https://github.com/http-kit/http-kit/releases/tag/v2.3.0-RC1

Configurable ramp-up for concurrency

At the moment clj-gatling starts the simulation with a given number of concurrent users and runs simulation constantly with the same number of users. In load testing it's often better to ramp-up users within a given time period. For example Gatling provides a huge set of configurable algorithms for specifying the ramp-up.

clj-gatling philosophy is not to have a dsl but instead give the user a control for specifying her own algorithm. My proposal for this is following. Instead of configuring number of concurrent users as an integer the concurrency will be a function that gets a progress as a parameter and returns the number of users. Example:

Constant load with 100 users

:concurrency 
  (fn [_]
    (constantly 100))

Ramp up (first 10, then 50 and finally 100 users)

:concurrency
  (fn [progress-percentage]
    (cond 
      (< progress-percentage 5) 10
      (< progress-percentage 10) 50
      :else 100))

Does this make sense?

IndexOutOfBoundsException when :concurrency is 1800+

Runs ok with :concurrency 1700 or less. With :concurrency 1800 or above throws on startup:

Exception in thread "main" java.lang.IndexOutOfBoundsException
	at clojure.lang.PersistentVector.assocN(PersistentVector.java:188)
	at clojure.lang.PersistentVector.assocN(PersistentVector.java:22)
	at clojure.lang.APersistentVector.assoc(APersistentVector.java:343)
	at clojure.lang.APersistentVector.assoc(APersistentVector.java:18)
	at clojure.lang.RT.assoc(RT.java:792)
	at clojure.core$assoc__4371.invokeStatic(core.clj:191)
	at clojure.core$update.invokeStatic(core.clj:5960)
	at clojure.core$update.invoke(core.clj:5952)
	at clj_gatling.simulation_util$split_to_buckets_with_sizes$fn__8097.invoke(simulation_util.clj:52)
	at clojure.lang.LongRange.reduce(LongRange.java:233)
	at clojure.core$reduce.invokeStatic(core.clj:6544)
	at clojure.core$reduce.invoke(core.clj:6527)
	at clj_gatling.simulation_util$split_to_buckets_with_sizes.invokeStatic(simulation_util.clj:51)
	at clj_gatling.simulation_util$split_to_buckets_with_sizes.invoke(simulation_util.clj:50)
	at clj_gatling.simulation_util$weighted_scenarios.invokeStatic(simulation_util.clj:77)
	at clj_gatling.simulation_util$weighted_scenarios.invoke(simulation_util.clj:68)
	at clj_gatling.simulation$run.invokeStatic(simulation.clj:153)
	at clj_gatling.simulation$run.invoke(simulation.clj:142)
	at clj_gatling.core$run.invokeStatic(core.clj:52)
	at clj_gatling.core$run.invoke(core.clj:44)
	at net.company.project.load.core$_main.invokeStatic(core.clj:24)
	at net.company.project.load.core$_main.invoke(core.clj:9)
	at clojure.lang.AFn.applyToHelper(AFn.java:160)
	at clojure.lang.AFn.applyTo(AFn.java:144)
	at net.company.project.load.core.main(Unknown Source)

(Thanks for the Clojure wrapper around Gatling!)

More accurate response time

Just throwing my notice/idea out. Please feel free to ignore it if it's not align with your clj-gatling direction.

It seems like clj-gatling reports response time by timing the whole time spent in each step function. I think this could be hugely misleading as the time may spent more in the test logistics rather than in the working of system under test which we care more about.

Some example of the thing I'm doing inside of step function:

http client parses stuffs such as headers, cookie
set cookie for the next step
get cookie from the previous step
parse body to get form token

Have you had any thought about this? I personally not sure how could we fix this situation. One idea is to provide 2 hooks for calling to start/stop the timer. Then trying to hook that with the thing that exercise system under test such as the http clients

Thanks

Bug: Asynchronous report writer has sometimes not finished before generating chart

File based raw reporter does not work when there are less than 20000 requests in simulation

File based raw reporter clj-gatling.reporters.raw-reporter/file-reporter should batch results to files with batch size 20000. If there are multiple batches everything works fine. However, if simulation generates only one file the batching does not work correctly and raw.log file is empty,

Grace period between steps in a simulation

To simulate real usage by real users, it would be nice to be able to introduce some sort of grace-period between functions in the simulation

(gat/run-simulation
   [{:name "my stress test"
     :requests [{:name "welcome-page" :fn welcome-page }
                {:sleep :period 10} ; user waits 10 seconds before accessing the login page
                {:name "login-page" :fn login-page }]}] 1)

I guess one could also imagine something like:

(gat/run-simulation
   [{:name "my stress test"
     :requests [{:name "welcome-page" :fn welcome-page :sleep-after 10 } ; sleep 10 secs after this reqest
                {:name "login-page" :fn login-page :sleep-before 10 } ; sleep 10 secs before this request
]}] 1)

Skip adding :user-id on each step

I think it is better to add :user-id once in the function run-scenario-once before going in the go-loop instead of trying to do it in the function async-function-with-timeout -->> original-context-with-user (assoc original-context :user-id user-id) on each step of the scenatio.

Add more accurate percentile to Global Information

As Gil mentions in his talk http://www.infoq.com/presentations/latency-response-time
95th and 99th percentiles are not always enough.

Add instance-id per running instance of each scenario

Since user-id is limited to the number of concurrent users, having an instance-id per running instance of each scenario could be helpful for picking next test data item from input data sets. It could be used in data feeders implementations. instance-id of running instances of different scenarios, all can start from zero.
Following table shows context value for two scenarios A and B with concurrency 3:

context for running instance of scenario A	context for running instance of scenario B
instance-id: 0, user-id: 0	instance-id: 0, user-id: 1
instance-id: 1, user-id: 2	instance-id: 1, user-id: 0
instance-id: 2, user-id: 1	instance-id: 2, user-id: 2
instance-id: 3, user-id: 0	instance-id: 3, user-id: 1
...	...

Furthermore, if we could set instance-id-strategy in simulation for specifying how to generate instance-id can be an extra helpful feature. In the other words, to determine to create instance-ids sequentially regardless of they belong to which scenario or to be unique per scenario.

Following table shows context value for two scenarios A and B with concurrency 3:

context for running instance of scenario A	context for running instance of scenario B
instance-id: 0, user-id: 0	instance-id: 1, user-id: 1
instance-id: 2, user-id: 2	instance-id: 3, user-id: 0
instance-id: 4, user-id: 1	instance-id: 5, user-id: 2
instance-id: 6, user-id: 0	instance-id: 7, user-id: 1
...	...

java.lang.StackOverflowError for over 2500 concurrent users

I'm trying to generate heavy load for some backend system but unfortunately I was hit by java.lang.StackOverflowError. Reproduction for this bug is super easy with your clj-gatling-example project.

$ lein run metrics 2800 1
Exception in thread "main" java.lang.StackOverflowError, compiling:(/private/var/folders/f5/k0cgb1rs0hldwslr_l5hlpkc0000gn/T/form-init8067742416466387465.clj:1:125)
	at clojure.lang.Compiler.load(Compiler.java:7391)
	at clojure.lang.Compiler.loadFile(Compiler.java:7317)
	at clojure.main$load_script.invokeStatic(main.clj:275)
	at clojure.main$init_opt.invokeStatic(main.clj:277)
	at clojure.main$init_opt.invoke(main.clj:277)
	at clojure.main$initialize.invokeStatic(main.clj:308)
	at clojure.main$null_opt.invokeStatic(main.clj:342)
	at clojure.main$null_opt.invoke(main.clj:339)
	at clojure.main$main.invokeStatic(main.clj:421)
	at clojure.main$main.doInvoke(main.clj:384)
	at clojure.lang.RestFn.invoke(RestFn.java:421)
	at clojure.lang.Var.invoke(Var.java:383)
	at clojure.lang.AFn.applyToHelper(AFn.java:156)
	at clojure.lang.Var.applyTo(Var.java:700)
	at clojure.main.main(main.java:37)
Caused by: java.lang.StackOverflowError
	at clojure.core$seq__4357.invokeStatic(core.clj:137)
	at clojure.core$map$fn__4789.invoke(core.clj:2648)
	at clojure.lang.LazySeq.sval(LazySeq.java:40)
	at clojure.lang.LazySeq.seq(LazySeq.java:49)
	at clojure.lang.RT.seq(RT.java:521)
	at clojure.core$seq__4357.invokeStatic(core.clj:137)
	at clojure.core$map$fn__4789.invoke(core.clj:2648)
	at clojure.lang.LazySeq.sval(LazySeq.java:40)
	at clojure.lang.LazySeq.seq(LazySeq.java:49)
	at clojure.lang.RT.seq(RT.java:521)
	at clojure.core$seq__4357.invokeStatic(core.clj:137)
	at clojure.core$map$fn__4789.invoke(core.clj:2648)
	at clojure.lang.LazySeq.sval(LazySeq.java:40)
	at clojure.lang.LazySeq.seq(LazySeq.java:49)
	at clojure.lang.RT.seq(RT.java:521)
...

Do you have any clues what is core problem here? I'd like to help.

dynamic scenarios

First, nice work with clj-gatling! As I understand it, scenarios are defined as a fixed sequence of steps at the start. How would one go about creating more dynamic scenarios where the next step is determined by the result of a request ? It would be useful if steps could return the next steps to take.

Allow for initial context to be read from scenario directly

So that request functions can be written in a reusable fashion.

E.g.

:scenarios [{:name "..." :steps [...] :context {:product-slug "my-product-slug"}}]

(defn product-request [{:keys [product-slug] :as context}]
   ... use product-slug to produce a url for request ...)

Pre and Post hooks for simulations

In case this exists and it's a documentation issue, I'm finding several simulation settings where I want to configure or setup accounts on the server before running my scenarios, and then resetting the server afterwords for repeatability. To properly test a mix of workloads against a single account, it's helpful to do this outside of a specific scenario so the scenarios, while operating more or less independently, are operating on a common set of data setup by the post hook. e.g.

(def simulation
   {:name "My Simulation"
    :pre-hook setup-fn
    :scenarios  [ ... ]
    :post-hook cleanup-fn})

Problems with Ring and http-kit versions

Added this to dependencies. Before writing any code which actually invokes the library the compile started failing.

Caused by: java.lang.NoSuchMethodError: clojure.lang.Reflector.invokeNoArgInstanceMember(Ljava/lang/Object;Ljava/lang/String;Z)Ljava/lang/Object;
    at clout.core$route_compile$fn__685.invoke(core.clj:134)

I have not been able to pinpoint the specific combination of library versions which are required. This is related to Ring and http-kit versions. It seems to work fine if you can use the same versions which are used in clj-gatling but may require much tweaking if other versions are required.

Improve performance by configuring thread pool size based on concurrency

clj-gatling uses core.async version 0.2.374 which has a "fixed" size thread pool. The size is based on number of cores the machine have but there is no way for user to increase the size. For load tests that use high performance asynchronous non blocking IO (e.g. http-kit) library this is not an big issue. However libraries that require thread per get request (e.g. clj-http) this is a real limitation.

Latest version of core.async supports setting the thread pool size using system property. With that the thread pool could be set to match the concurrency used in the simulation.

Support simulations with callback vs. just return values

I have a web system that I want to load test and evaluate for its end to end performance where the request starts off a process, but the actual transaction is completed only when the system POSTs to a URL endpoint. We're interested in getting a sense of both latency and throughput for the GET -> POST response transaction. Is there a clean way to implement this, or do you have suggestions for a patch I could submit? We've been looking at doing something more ad-hoc on top of Riemann, but intrigued by your approach here.

Allow for early termination and partial data collection

Running long stress tests with clj-gatling is rather opaque, furthermore I have had tests run far far longer than I specified in the duration. I would like to have the option to terminate the process early and have some insight into how the process is going without having to wait until the end. Perhaps the call to run simulation could return immediately with a func that can be called to terminate the simulation and an atom that is periodically updated with partial information?

Feeding and templating

Hello, any feature for custom feeding ?
Did this library support any templating like mustashe based on feeding data ?

Steps running serially?

From what I understood from the docs, steps in a given scenario will run serially, so one after the other.
I'm seeing that if add an extra step the requests/sec more or less halves, which would suggest that they are actually being run in parallel and I'm maxing out on the local concurrency.

In case it helps I'm using clj-http and I tweaked the clojure.core.async-pool-size, my other attempts (using core.async and the async true option) failed, but just tweak at least seems to do the job.

Catch exceptions in request functions and mark those requests as failed

If user written request function fails with exception it will never call the callback function. Eventually clj-gatling fails this request because of timeout. However, it will be more clear to just immediately catch the exception, mark it as failed and continue the scenario

How to identify concurrency issues

Kind of related to:
#67
I was wondering if you have hints about how to detect when the results (like reqs/sec) you get are actually the maximum your server can deal with, or the bottleneck is just the way you run the test.

I guess it's not easy to do find out, for example using clj-http until I saw the clojure.core.async.pool-size mention I was getting wrong results, but since I didn't actually know what was the maximum my server was capable of, I didn't know.

Maybe clj-gatling could somehow detect if you're doing something silly and alert you?

Option for stopping simulation either after running scenario fully or after each step

clj-gatling checks after running a full user scenario whether it should continue running the simulation. The condition for this is either max number of requests or given duration. This way final simulation always contains full scenarios. However, sometimes having a scenario with multiple steps can cause problems. If given duration is already passed after first step clj-gatling still keeps running all the other steps.

The solution is to add an simulation option (true/false) for specifying in which mode you want to clj-gatling to run. Either check stop condition after each step or after a full scenario

Run test for a given time

Would be neat to run requests for a given duration

concurrency as users per second

Do you think it would be an idea to specify concurrency as users per second as Gatling does?

Option to provide custom measurements and errors to reporter

Hello and thanks for this great lib 👍

My problem with it is that it measures time which :request function took to return 'true' and in my case that is not exactly what I want to measure, because what my :request function does is:

(defn api-call [{:keys [my-api-call-fn body] :as context}]
  (let [params (check-body-with-schema body)      ;checks some params before actual request
        {:keys [status actual-request-time body]} (my-api-call-fn params)]   ;constructs and sends actual request
     (if (and  (= status 200) 
               (convert-json-to-edn-and-check-it-with-clojure-specs body)) ; checks responce body
      true
      false))) ;  and only after that is will return

But what I want to measure is actual-request-time (or ideally request and response time) so it would be great if I would be able to supply those response times to gatling reporter by returning some map or a key by which it can then lookup those values in context map.

Also same thing go with errors supplying those to reporter would be awesome.

I hope I was clear and thanks for your response in advance.

Issues with high concurrency

I'm currently performing tests setting the concurrency to about 10k.
However I'm getting inconsistent results, please correct me if I'm understanding something wrong.

I currently only have one scenario with 6 steps. My intention is to perform 60k requests in that sequential manner as defined in the steps, since each request depends on the result of the one before. However the requests don't seem to be evenly distributed, which I don't quite get why, on one test the first step had been done 8.5k times and on another test the first step had been done 35k times, more than half of the total of requests.
Is there something I'm missing as to why is the test behaving this way? The only argument I'm setting is the concurrency, nothing else, other than passing context between steps.

Return some sort of results map in core/run-simulation

Hi there!

First of all thanks for writing this tool! Simple and lightweight =)

We are doing some experiments on my current project and one thing we felt missing was the ability to fail the performance run if any (or some arbitrary number of) request failed.

We are currently re-parsing the simulation log to do that, but it would be cool if core/run-simulation returned some sort of map for the results, so we can easily do any post processing we want in Clojure. I am more than willing to write the pull requests for this, but I would love to discuss and validate the idea first.

Report directory should be cleaned before starting the simulation

If there are existing simulationX.log files in directory they will be also added to report which is very confusing

clj-time

I noticed that this project requires and uses clj-time, which is deprecated in favour of the java API from version 8 onwards.
https://github.com/clj-time/clj-time

It would be great to just require the java API object at least in the scenario configuration (for the duration) instead of a joda-time object.

Add scenario pre-hook and post-hook

In similar fashion than simulation hooks there should be hooks that run before and after each scenario.

Log timeouts

If step function timeouts clj-gatling marks it as failed. It is hard to distinct it from other errors. The clj-gatling should log timeouts somehow.

With uneven scenario weights some scenarios might run with concurrency 0

For example having 2 scenarios with weights 10 and 1. If simulation is run with 10 users the second scenario is not run at all because clj-gatling sets its concurrency to 0

How to run a test for specific duration in Gatling 2.2

I have tried .during (x seconds) {

}
and also tried
setUp(scn.inject(rampUsers(200) over (100 seconds), during (560 seconds))).protocols(httpProtocol)

but none of above option works however if I remove the duration clause and just it with 100 seconds ramp test runs fine.

I searched like entire internet but didn't found any hint or solution on this.
Also I am using csv feeder for username password and kept the data extraction circular for CSV still I am getting error after sometime into test saying "no attribute found for username"
CSV file looks like this, I don't know why gatling has so weirdo format of csv rather than having it simple like jmeter.
username,passwordj2ee,j2eej2ee,j2eej2ee,j2eej2ee,j2eej2ee,j2eej2ee,j2eej2ee,j2eej2ee,j2eej2ee,j2ee

I have kept the csv like below first but it didn't worked then I found above sample somewhere which I referred.
username,password
j2ee,j2ee

My code is
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import io.gatling.jdbc.Predef._
import scala.concurrent.duration._

class LoginRecordedSimulation extends Simulation {

val httpProtocol = http
    .baseURL("http://localhost:8080")
    .inferHtmlResources()
    .acceptHeader("text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
    .acceptEncodingHeader("gzip, deflate, sdch")
    .acceptLanguageHeader("en-US,en;q=0.8")
    .userAgentHeader("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36")

val headers_0 = Map("Upgrade-Insecure-Requests" -> "1")

val headers_2 = Map(
    "Accept-Encoding" -> "gzip, deflate",
    "Origin" -> "http://localhost:8080",
    "Upgrade-Insecure-Requests" -> "1")

val headers_3 = Map("Accept" -> "image/webp,image/*,*/*;q=0.8")

val csvfeeder = csv("username.csv").circular

val uri1 = "http://localhost:8080/jpetstore"



val scn = scenario("LoginRecordedSimulation")
.during (560 seconds)
{

    .exec(http("request_0")
        .get("/jpetstore")
        .headers(headers_0))
    .pause(8)
    .exec(http("request_1")
        .get("/jpetstore/shop/signonForm.shtml;jsessionid=5C5EC505BC93B2756A9F83F0308B8E61")
        .headers(headers_0))
    .pause(5)
    .exec(http("request_2")
        .post("/jpetstore/shop/signon.shtml")
        .headers(headers_2)
        .formParam("username", "${username}")
        .formParam("password", "${password}")
        .formParam("submit", "Login")
        .resources(http("request_3")
        .get("/jpetstore/images/banner_dogs.gif")
        .check(regex("Welcome").exists)
        .headers(headers_3)))
    .pause(6)
    .exec(http("request_4")
        .get("/jpetstore/shop/viewCategory.shtml?categoryId=DOGS")
        .headers(headers_0))
    .pause(2)
    .exec(http("request_5")
        .get("/jpetstore/shop/viewProduct.shtml?productId=K9-BD-01")
        .headers(headers_0))
    .pause(12)
    .exec(http("request_6")
        .get("/jpetstore/shop/viewItem.shtml?itemId=EST-6")
        .headers(headers_0)
        .resources(http("request_7")
        .get("/jpetstore/images/dog2.gif")
        .headers(headers_3)))
    .pause(6)
    .exec(http("request_8")
        .get("/jpetstore/shop/addItemToCart.shtml?workingItemId=EST-6")
        .headers(headers_0))
    .pause(8)
    .exec(http("request_9")
        .get("/jpetstore/shop/removeItemFromCart.shtml?workingItemId=EST-6")
        .headers(headers_0))
    .pause(7)
    .exec(http("request_10")
        .get("/jpetstore/shop/signoff.shtml")
        .headers(headers_0))
}
setUp(scn.inject(rampUsers(200) over (100 seconds))).protocols(httpProtocol)

}

A more involved tutorial

I'd really like to see a more involved example where the result from one request is used either as input for the next request or as a way to decide the next request.

Example:
I'd like to performance-test my web-shop application through the following scenario (assuming each step has a separate UR)L:

log in
if login ok
2) display list of products
3) select a random product from that list and put it in my shopping basket
4) checkout

And I want timing on all these endpoints.

Great talk yesterday :)

Simulation with concurrency distribution fails when duration left unspecified

The progress value given to concurrency distribution callback stalls at 0.0 if :requests is specified in options instead of concurrency. I suggest disabling the feature in such a case.

Support for passing state between requests in same scenario

Add raw reporter

It would be nice if clj-gatling had the ability to return the raw data as EDN.

One hack to get do that could be to parse the stats.js, which contains the data you want in a format like:

stats: {
    "name": "Global Information",
    "numberOfRequests": {
        "total": "10000",
        "ok": "10000",
        "ko": "0"
    },
    "minResponseTime": {
...

which works but it's a massive hack.
This reporter could be also a separate library at first and integrate in clj-gatling if potentially useful to everyone.

Support for more fine detailed response and request times

Default reporter measures how long it takes to execute the request function. If your request function is http request that measurement also includes the time request was made, response was parsed etc. Many load testing tools give you more detailed values for those cases. For example: things like TCP connect time, DNS resolution time, SSL handshake duration etc..

Now that there is support for custom reporters you can implement those things by making the measurements in request function and then adding new keys to context. Those values are then available for custom reporter. However, doing that is quite a lot of work.

People have been asking about this feature. Therefore it would be good to at least investigate how this could be done and would it be possible to use this info in default reporter (which still is Gatling Highcharts)

Graphite API

Do you think it would be an idea to create a Graphite API so data could be pumped into a datastore and visualised?

An issue with async/timeout

There is raised an issue when the concurrency is high, and the concurrency distribution left much enough users to wait 200 milisecond to check again is it should run now. As the implementation of async/timeout uses timeouts map there is a case when one timeout channel instance is used for async take multiple times. Then an error is raised:
xception in thread "async-dispatch-216" java.lang.AssertionError: Assert failed: No more than 1024 pending takes are allowed on a single channel.
(< (.size takes) impl/MAX-QUEUE-SIZE)
at clojure.core.async.impl.channels.ManyToManyChannel.take_BANG_(channels.clj:235)
at clojure.core.async.impl.ioc_macros$take_BANG_.invokeStatic(ioc_macros.clj:983)
at clojure.core.async.impl.ioc_macros$take_BANG_.invoke(ioc_macros.clj:982)
at clj_gatling.simulation$run_scenario_constantly$fn__818$state_machine__5041__auto____847$fn__849.invoke(simulation.clj:116)
at clj_gatling.simulation$run_scenario_constantly$fn__818$state_machine__5041__auto____847.invoke(simulation.clj:109)
at clojure.core.async.impl.ioc_macros$run_state_machine.invokeStatic(ioc_macros.clj:973)
at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:972)
at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invokeStatic(ioc_macros.clj:977)
at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:975)
at clojure.core.async.impl.ioc_macros$take_BANG_$fn__5059.invoke(ioc_macros.clj:986)
at clojure.core.async.impl.channels.ManyToManyChannel$fn__735.invoke(channels.clj:265)
at clojure.lang.AFn.run(AFn.java:22)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

My test scenario is executed with concurrency 10k and concurrency-distribution 0.1. When I remove the distribution the problem with the timeout disappeared because no one user has to wait.

mhjort / trombi Goto Github PK

trombi's Introduction

trombi

Installation

Usage

Trivial example

Concepts

Global simulation Hooks

Scenarios

Scenario steps

Scenario hooks

Dynamic scenarios

Options

Ramp-up

Progress tracker

Tuning parallelism

Reporters

Additional dependencies

Examples

Tuning the test runner

Change History

Jenkins

Why

Design Philosophy

Real life scenarios

No DSL

Distributed load testing

Contribute

Development

License

trombi's People

Contributors

Stargazers

Watchers

Forkers

trombi's Issues

Recommend Projects

Recommend Topics

Recommend Org