@page benchmarking Benchmarking
This guide explains how to use checkasm's benchmarking capabilities to accurately measure and compare the performance of optimized assembly implementations against reference code.
@tableofcontents
@section bench_basic Basics
@subsection bench_workflow Benchmark Workflow
Benchmarking in checkasm follows the same structure as correctness testing, with an additional call to checkasm_bench_new():
@code{.c} BUF_RECT(uint8_t, src, 64, 64); BUF_RECT(uint8_t, dst_c, 64, 64); BUF_RECT(uint8_t, dst_a, 64, 64);
checkasm_declare(void, uint8_t *dst, ptrdiff_t dst_stride, const uint8_t *src, ptrdiff_t src_stride, int w, int h);
if (checkasm_check_func(dsp->filter, “filter_64x64”)) { INITIALIZE_BUF_RECT(src); CLEAR_BUF_RECT(dst_c); CLEAR_BUF_RECT(dst_a);
// Correctness testing
checkasm_call_ref(dst_c, dst_c_stride, src, src_stride, 64, 64);
checkasm_call_new(dst_a, dst_a_stride, src, src_stride, 64, 64);
checkasm_check_rect_padded(dst_c, dst_c_stride,
dst_a, dst_a_stride, 64, 64, "dst");
// Benchmarking
checkasm_bench_new(checkasm_alternate(dst_a, dst_c), dst_a_stride,
src, src_stride, 64, 64);
}
checkasm_report(“filter”); @endcode
@subsection bench_cli Running Benchmarks
Enable benchmarking with the --bench flag:
@code{.bash}
./checkasm --bench
./checkasm --bench --duration=10000
./checkasm --bench --function=‘filter_*’
./checkasm --bench --verbose @endcode
The --duration parameter controls how long (in microseconds) each function is benchmarked. Longer durations provide more accurate results but take more time. The default is typically sufficient for most cases.
@subsection bench_export Exporting Results
checkasm can export benchmark results in multiple formats:
@code{.bash}
./checkasm --bench --csv > results.csv # without column headers ./checkasm --bench --csv --verbose > results.csv # with column headers
./checkasm --bench --json > results.json
./checkasm --bench --html > results.html @endcode
The JSON output format includes all measurement data and detailed statistical parameters, including kernel density estimates, regression parameters, and confidence intervals. The HTML output displays this same data in the form of interactive charts.
@section bench_methodology Statistical Methodology
@subsection bench_lognormal Log-Normal Distribution Modeling
checkasm models execution time as a log-normal distribution, which is well-suited for performance measurements because:
The statistical estimator tracks two parameters:
From these, checkasm computes:
@subsection bench_regression Linear Regression
checkasm performs linear regression in log-space on the relationship between iteration count and total execution time:
@code{.plaintext} log(per_call_time) = log(total_time) - log(iterations) @endcode
This approach:
@subsection bench_geometric Geometric Mean for Multiple Runs
When checkasm_bench_new() is called multiple times for the same function (e.g., testing different block sizes), the final reported value is the geometric mean of all measurements:
@code{.plaintext} geometric_mean = (x₁ × x₂ × ... × xₙ)^(1/n) @endcode
The geometric mean is appropriate for performance measurements because it:
@subsection bench_overhead Overhead Correction
checkasm measures and subtracts the overhead of:
This is done by measuring a no-op function and subtracting its measured time from all benchmark results. The no-op overhead is reported at startup:
@code{.plaintext}
For accuracy, this is re-estimated periodically throughout the benchmarking process to account for any drift. The final value is reported again at the end if --verbose is enabled.
@section bench_best_practices Best Practices
@subsection bp_system_state System State
Despite all of the statistical techniques employed by checkasm to combat short- and medium-term noise, there is an unavoidable dependence on long-term changes in system state. For reliable benchmarking, consider:
Power Management:
Disable CPU frequency scaling if possible: @code{.bash}
sudo cpupower frequency-set --governor performance
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance | sudo tee $cpu done @endcode
Disable turbo boost for consistent results: @code{.bash}
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost @endcode
System Load:
nice to prioritize the benchmark process: @code{.bash} nice -n -20 ./checkasm --bench @endcodeThermal Throttling:
@subsection bp_alignment Cache Alignment
checkasm automatically warms up caches before taking measurements, but you should be aware of cache effects:
For functions with large working sets that don't fit in cache, benchmark results may reflect cache miss behavior, which is often realistic for real-world usage.
@subsection bp_alternating Buffer Alternation
Use checkasm_alternate() when benchmarking to prevent cache pollution:
@code{.c} // Good: alternates between dst_a and dst_c to prevent cache hits from // previous iterations checkasm_bench_new(checkasm_alternate(dst_a, dst_c), dst_a_stride, src, src_stride, w, h);
// Acceptable: always writes to dst_a checkasm_bench_new(dst_a, dst_a_stride, src, src_stride, w, h); @endcode
Alternating buffers ensures that benchmarks are not stalled by previous access to the same data buffer from the prior loop iteration.
@subsection bp_realistic Realistic Test Data
Use realistic input data for benchmarks:
@code{.c} // For general data processing: use INITIALIZE_BUF() which includes common edge cases INITIALIZE_BUF_RECT(src);
// For specific patterns: use domain-appropriate data checkasm_randomize_normf(audio_buf, len); // Audio: normal distribution
// For worst-case analysis: test pathological inputs generate_worst_case_pattern(buf, len); @endcode
The input data can significantly affect performance due to:
If your function's performance varies significantly with the input data or configuration, consider looping over all such configurations and running checkasm_bench_new() for each, to measure an overall average.
@subsection bp_configurations Choosing Configurations
When benchmarking functions that support multiple sizes or configurations:
Option 1: Benchmark all configurations @code{.c} for (int w = 4; w <= 128; w <<= 1) { if (checkasm_check_func(dsp->filter, “filter_w%d”, w)) { for (int h = 4; h <= 128; h <<= 1) { // Test for correctness // ...
// Benchmark each configuration
checkasm_bench_new(dst, dst_stride, src, src_stride, w, h);
}
// Reports geometric mean of all configurations
}
} @endcode
Option 2: Benchmark representative sizes separately @code{.c} const int sizes[][2] = { {16, 16}, {16, 32}, {64, 16}, {64, 32} };
for (int i = 0; i < sizeof(sizes) / sizeof(sizes[0]); i++) { int w = sizes[i][0], h = sizes[i][1];
// Separate check_func call = separate benchmark report
if (checkasm_check_func(dsp->filter, "filter_%dx%d", w, h)) {
// Test for correctness
// ...
checkasm_bench_new(dst, dst_stride, src, src_stride, w, h);
}
} @endcode
Option 3: Benchmark only the limiting case @code{.c} for (int w = 4; w <= 128; w <<= 1) { if (checkasm_check_func(dsp->filter, “filter_w%d”, w)) { for (int h = 4; h <= 128; h <<= 1) { // Test all for correctness // ... }
// Benchmark only the largest size
checkasm_bench_new(dst, dst_stride, src, src_stride, w, 128);
}
} @endcode
Choose based on your needs:
@section bench_interpreting Interpreting Results
@subsection interp_output Understanding Output
checkasm's benchmark output provides several pieces of information. Pass --verbose to see all timing measurements.
@code{.plaintext} Benchmark results: name cycles +/- stddev time (nanoseconds) (vs ref) nop: 2.4 +/- 0.0 1.2 ns +/- 0.1 filter_c: 64.9 +/- 75.4 33.8 ns +/- 40.8 filter_sse2: 46.9 +/- 13.3 24.4 ns +/- 7.8 ( 1.22x) filter_avx2: 20.6 +/- 1.2 10.7 ns +/- 1.4 ( 3.15x) @endcode
Columns:
What to look for:
@subsection interp_variance High Variance
High standard deviation (large ± values) can indicate:
System noise: Background processes, interrupts, frequency scaling
Data-dependent performance: Function runs faster/slower on different inputs
Cache effects: Function doesn't fit in cache or has cache conflicts
The benchmark summary reports average timing error: @code{.plaintext}
High maximum error typically indicates at least one very noisy measurement.
@subsection interp_comparison Comparing Implementations
When comparing optimized implementations:
Absolute speedup: @code{.plaintext} filter_avx2: 20.6 cycles ( 3.15x) @endcode This implementation is 3.15× faster than the C reference.
Relative comparison: @code{.plaintext} filter_sse2: 46.9 cycles ( 1.22x) filter_avx2: 20.6 cycles ( 3.15x) @endcode AVX2 is 46.9 / 20.6 = 2.28× faster than SSE2.
@subsection interp_regression Regression Detection
Use benchmark results to detect performance regressions:
Baseline measurements: Save benchmark results for your codebase: @code{.bash} ./checkasm --bench --json > baseline.json @endcode
After changes: Run benchmarks again: @code{.bash} ./checkasm --bench --json > current.json @endcode
Compare: Look for functions that got slower
@section bench_advanced Advanced Topics
@subsection adv_microbench Microbenchmarking Pitfalls
Be aware of common microbenchmarking issues:
@subsection adv_platform Platform Considerations
@subsubsection adv_timer Timer Resolution checkasm reports timer resolution at startup: @code{.plaintext}
Timing source: x86 (rdtsc)
Timing resolution: 0.5976 +/- 0.057 ns/cycle (1644 +/- 156.8 MHz) (provisional) @endcode
x86/x86_64: rdtsc (cycle counter) - very high resolution
ARM/AArch64: pmccntr (cycle counter) - high resolution
LoongArch: rdtime (tick counter) - high resolution
PowerPC 64le: mfspr (tick counter) - medium resolution
Other/Fallback: OS-provided timers - lower resolution
Lower resolution timers may require longer --duration for accurate results.
@subsubsection adv_freq_scaling Frequency Scaling The timer resolution includes clock frequency estimation. If CPU frequency scaling is enabled, this estimate may be inaccurate. However, this affects only the conversion to nanoseconds, not cycle counts. For most accurate results, disable frequency scaling, or compare only raw cycle counts (for platforms with access to high-resolution cycle counters).
@subsubsection adv_cross_platform Cross-Platform Comparison Comparing cycle counts across different CPUs is meaningful when:
For cross-platform comparison, use relative speedup (optimized vs C) rather than absolute cycle counts.
@subsection adv_html HTML Report Overview
The HTML report provides detailed statistical visualizations:
@subsubsection adv_kde_regression Kernel Density Estimate (left chart)
@subsubsection adv_raw_measurements Raw Measurements (right chart)
@subsubsection adv_metrics Metrics Table
@section bench_tips Tips and Tricks
@subsection tips_reproducible Reproducible Benchmarks
For reproducible results:
Use fixed random seed: @code{.bash} ./checkasm --bench 12345 # Use seed 12345 @endcode This ensures the same test data patterns across runs.
Document system state:
Multiple runs: Run benchmarks multiple times and verify consistency: @code{.bash} for i in {1..5}; do ./checkasm --bench --function=‘filter_*’ | tee run_$i.txt done @endcode