Rust stress test library

This document explains how to write a stress test using the rust stress test library. The library can be found at //src/sys/lib/stress-test. It implements the test loop and the concurrency and synchronization primitives required for running these tests.

Writing a stress test

Define the GN build targets

Define rustc_binary, fuchsia_component and fuchsia_test_package GN build targets for your test:

rustc_binary("filesystem-stressor-bin") {
    deps = [
        ...
        "//src/sys/lib/stress-test",
        ...
    ]
    sources = [
        ...
    ]
}

fuchsia_component("filesystem-stressor") {
    deps = [ ":filesystem-stressor-bin" ]
    manifest = "meta/filesystem-stressor.cmx"
    testonly = true
}

fuchsia_test_package("filesystem-stress-tests") {
  test_components = [
    ":filesystem-stressor"
  ]
}

Write an actor

Every actor must implement the Actor trait. The actor trait is one method perform() that is invoked by an ActorRunner. When invoked, the actor must perform exactly one operation and return its result to the runner. An actor must stores all the connections necessary to perform operations.

pub trait Actor: Sync + Send + 'static {
    // ActorRunner invokes this function, instructing the actor
    // to perform exactly one operation and return result.
    async fn perform(&mut self) -> Result<(), ActorError>;
}

An actor can indicate the following with the return result:

Ok(()): Operation succeeded and is added to the global operation count.
Err(ActorError::DoNotCount): The operation must not be counted towards the global operation count.
Err(ActorError::ResetEnvironment): The environment must be reset and the operation must not be counted towards the global operation count.

When an actor encounters an unexpected error, it should panic, thus stopping the test.

Since actors are operating on the same environment, it is possible that their operations will collide. For example, for a filesystem stress test, actors may operate on the same set of files. If such collisions are desirable, you must setup actors to handle such collisions gracefully. If not, the actor should panic, causing the test to stop.

An actor can intentionally break the system-under-test, requiring the environment to be reset. For example, for a filesystem stress test, an actor can randomly sever the connection between the filesystem and the underlying block device. In this example, other actors should request a new environment with ActorError::ResetEnvironment, and the environment will re-establish connections for all of the actors.

Note: The mutable connections of the actor should be marked public, so that the environment can update them during reset.

pub struct FilesystemActor {
    /// Store a connection to the root of filesystem here
    pub root_directory: Directory
    ...
}

impl FilesystemActor {
    pub fn new(root_directory: Directory) -> Self {
        ...
    }
}

#[async_trait]
impl Actor for FilesystemActor {
    async pub fn perform(&mut self) -> Result<(), ActorError> {
        // Choose exactly one operation to do on the filesystem
        // using the root_directory
        self.root_directory.delete_all_files();
    }
}

Write an Environment

The environment provides the basic configuration for the stress test - the exit criteria, the actors and a reset method.

pub trait Environment: Send + Sync + Debug {
    /// Returns the target number of operations to complete before exiting
    fn target_operations(&self) -> Option<u64>;

    /// Returns the number of seconds to wait before exiting
    fn timeout_seconds(&self) -> Option<u64>;

    /// Return the runners for all the actors
    async fn actor_runners(&mut self) -> Vec<ActorRunner>;

    /// Reset the environment, when an actor requests one
    async fn reset(&mut self);
}

An environment can store additional configuration for the test. You can provide this configuration through the command line with the argh crate.

An actor is shared between a runner and the environment and hence it must be wrapped as an Arc<Mutex<dyn Actor>>. Runners hold the lock while an actor is performing an operation. This means that the environment can only acquire an actor's lock between operations.

An environment is instructed to reset when an actor determines that the current instance of the system-under-test has been broken. The environment is expected to create a new instance for the system-under-test and lock on the actors to update their connections to the new instance.

The environment must also implement the Debug trait. Stress tests log the environment when the test starts and if the test panics. It is common practice to print out parameters that are valuable for reproducing the test, such as the random seed used.

#[derive(Debug)]
pub struct FilesystemEnvironment {
    fs_actor: Arc<Mutex<FilesystemActor>>,
    seed: u64,
    ...
}

impl Environment {
    pub fn new() -> Self {
        ...
    }
}

#[async_trait]
impl Environment for FilesystemEnvironment {
    fn target_operations(&self) -> Option<u64> {
        // By specifying None here, the test will run without an operation limit
        None
    }

    fn timeout_seconds(&self) -> Option<u64> {
        // By specifying None here, the test will run without a time limit
        None
    }

    async fn actor_runners(&mut self) -> Vec<ActorRunner> {
        vec![
            ActorRunner::new(
                "filesystem_actor",  // debug name
                60,  // delay (in seconds) between each operation (0 means no delay)
                self.fs_actor.clone()), // actor
            )
        ]
    }

    async fn reset(&mut self) {
        // If the actor is performing an operation, this will remain
        // locked until the operation is complete.
        let actor = self.fs_actor.lock().await;

        // Now the environment can update the actor before it is run again.
        actor.root_directory = ...;

        // Releasing the lock will resume the runner.
    }
}

Write the main function

The main function of a stress test is straightforward, since most of the logic is implemented in the Environment and Actors. Use the main function to collect command-line arguments (if any), initialize logging and set log severity.

#[fuchsia::main]
async fn main() {
    // Create the environment
    let env = FilesystemEnvironment::new();

    // Run the test.
    // Depending on the exit criteria, this may never return.
    stress_test::run_test(env).await;
}

Running stress tests locally

Since a stress test is a part of a fuchsia_test_package, one of the easiest ways to run it is with the fx test command:

fx test filesystem-stress-tests

Note: The stress test runs with the command line arguments defined in the component's manifest.

To run the test with custom command line arguments, use fx shell run:

fx shell run fuchsia-pkg://fuchsia.com/filesystem-stress-tests#meta/filesystem-stressor.cmx <args>

Running stress tests on infrastructure

A stress test is identified by infrastructure through the stress-tests tag that is attached to the fuchsia_test_package or fuchsia_unittest_package GN Build Target.

fuchsia_test_package("filesystem-stress-tests") {
  test_components = [
    ":filesystem-stressor"
  ]
  test_specs = {
    environments = [
      {
        dimensions = {
          device_type = "QEMU"
        }
        tags = [ "stress-tests" ]
      },
    ]
  }
}

A dedicated core.qemu-x64-stress builder identifies these tests and runs each test component in the package for a maximum of 22 hours.

Note: On infra bots, a stress test is required to show “signs of life” which is usually some form of output to show that the test is still running and has not hung.

Note: Stress tests are currently restricted to the QEMU device type, since they run for long periods of time.

Debugging a stress test

The framework uses the rust log crate to log messages. The test logs the environment object at start and if the test panics.

--------------------- stressor is starting -----------------------
Environment {
    seed: 268479717856254664270968796173957499835,
    filesystem_actor: { ... }
    ...
}
------------------------------------------------------------------

If debug logging is enabled, individual actor operations and operation counts are also logged.

DEBUG: [0][filesystem_actor][389] Sleeping for 2 seconds
DEBUG: [0][filesystem_actor][389] Performing...
DEBUG: [0][filesystem_actor][389] Done!
DEBUG: Counters -> [total:403] {"filesystem_actor": 403}