blob: fdd6c43fde18efcbe1f3ad6084e5fb4e2a71bf61 [file] [log] [blame] [view]
# Stress tests
## The need for stress tests
When writing a stateful system that interacts with other processes/components, it is possible to
introduce bugs that may not manifest until the system is put under load. These bugs arise from
incorrect assumptions, off-by-one errors or subtle timing issues. The errors are hard to detect in
advance because they are the result of complex, stateful interactions between layers of the system.
Stress tests reveal errors in these interactions by generating large workloads for the system
concurrently. If a workload produces an unexpected error, developers can root-cause the issue,
fix it and write a test that covers that specific interaction.
Our infrastructure also has a dedicated builder `core.x64-stress` for running stress tests. This
builder will run each stress test for a maximum of 22 hours.
Note: Stress tests are currently restricted to the `QEMU` device type, since they run for long
periods of time.
## Client Libraries
### Rust stress test library
We offer a Rust Client Library for writing stress tests. A guide to writing tests using that library
can be found [here](rust_stress_test_library.md).
#### Concepts
Developers can write stress tests that create an environment and produce actors
that indefinitely run randomized workloads on that environment.
##### Actor
An actor's responsibility is to perform operations on an environment. An actor should perform
operations that stress the environment. It can do this in one of two ways:
* Interacting cooperatively with the system-under-test (workload generation).
* Intentionally breaking the system-under-test (failure simulation).
##### Actor Runner
An Actor Runner is a thin wrapper on an Actor and is responsible for running the actor indefinitely.
Runners are created for each actor by the environment and are run on individual threads.
A runner repeatedly instructs its actor to perform a single operation and return the result.
##### Environment
An environment's responsibility is to:
* Store global state needed for the test.
* Provide the exit criteria for the test.
* Provide actors that run for the entire duration of the test.
* Reset global state when requested by an actor.
A stress test creates exactly one environment that lives for the entire duration of the test. A test
writer must define an environment and provide it to the stress test framework.
An environment can store global objects that outlive every instance of the system-under-test.
For example, if a filesystem test operates on a block device backed by a VMO, store that VMO in the
environment, so that even if an actor crashes the block device, the VMO remains intact and can be
used to reset the state.
The environment provides multiple runners to the framework, each containing an actor
that can perform operations.
On reset, an environment must update global state and the connections of each actor.