checkasm 1.0.1
Assembly testing and benchmarking framework
Loading...
Searching...
No Matches
Benchmarking

This guide explains how to use checkasm's benchmarking capabilities to accurately measure and compare the performance of optimized assembly implementations against reference code.

Basics

Benchmark Workflow

Benchmarking in checkasm follows the same structure as correctness testing, with an additional call to checkasm_bench_new():

BUF_RECT(uint8_t, src, 64, 64);
BUF_RECT(uint8_t, dst_c, 64, 64);
BUF_RECT(uint8_t, dst_a, 64, 64);
checkasm_declare(void, uint8_t *dst, ptrdiff_t dst_stride,
const uint8_t *src, ptrdiff_t src_stride,
int w, int h);
if (checkasm_check_func(dsp->filter, "filter_64x64")) {
// Correctness testing
checkasm_call_ref(dst_c, dst_c_stride, src, src_stride, 64, 64);
checkasm_call_new(dst_a, dst_a_stride, src, src_stride, 64, 64);
checkasm_check_rect_padded(dst_c, dst_c_stride,
dst_a, dst_a_stride, 64, 64, "dst");
// Benchmarking
checkasm_bench_new(checkasm_alternate(dst_a, dst_c), dst_a_stride,
src, src_stride, 64, 64);
}
checkasm_report("filter");
#define BUF_RECT(type, name, w, h)
Declare an aligned, padded rectangular buffer.
Definition utils.h:553
#define CLEAR_BUF_RECT(name)
Clear a rectangular buffer (including padding).
Definition utils.h:570
#define checkasm_check_rect_padded(rect1,...)
Compare two rectangular buffers including padding.
Definition utils.h:604
#define INITIALIZE_BUF_RECT(name)
Initialize a rectangular buffer (including padding) with pathological values.
Definition utils.h:578
#define checkasm_bench_new(...)
Benchmark the optimized implementation.
Definition test.h:407
#define checkasm_call_ref(...)
Call the reference implementation.
Definition test.h:327
CHECKASM_API void checkasm_report(const char *name,...) CHECKASM_PRINTF(1
Report test outcome for a named group of functions.
#define checkasm_alternate(a, b)
Alternate between two values during benchmarking.
Definition test.h:429
#define checkasm_call_new(...)
Call the implementation being tested with validation.
Definition test.h:342
#define checkasm_declare(ret,...)
Declare a function signature for testing.
Definition test.h:166
#define checkasm_check_func(func,...)
Check if a function should be tested and set up function references.
Definition test.h:76

Running Benchmarks

Enable benchmarking with the --bench flag:

# Quick benchmark (uses default duration)
./checkasm --bench
# Longer benchmark for more accurate results (10ms per function)
./checkasm --bench --duration=10000
# Benchmark specific functions only
./checkasm --bench --function='filter_*'
# Verbose output showing all timing measurements
./checkasm --bench --verbose

The --duration parameter controls how long (in microseconds) each function is benchmarked. Longer durations provide more accurate results but take more time. The default is typically sufficient for most cases.

Exporting Results

checkasm can export benchmark results in multiple formats:

# CSV format (suitable for spreadsheets)
./checkasm --bench --csv > results.csv # without column headers
./checkasm --bench --csv --verbose > results.csv # with column headers
# JSON format (for programmatic analysis, includes all data)
./checkasm --bench --json > results.json
# HTML format (interactive visualizations)
./checkasm --bench --html > results.html

The JSON output format includes all measurement data and detailed statistical parameters, including kernel density estimates, regression parameters, and confidence intervals. The HTML output displays this same data in the form of interactive charts.

Statistical Methodology

Log-Normal Distribution Modeling

checkasm models execution time as a log-normal distribution, which is well-suited for performance measurements because:

  1. Execution time is always positive
  2. Performance variations tend to be multiplicative (e.g. power states) rather than additive
  3. Outliers (e.g., from cache misses or interrupts) naturally fall into the long tail

The statistical estimator tracks two parameters:

  • Log mean (μ): the logarithm of the median execution time
  • Log variance (σ²): the variance of log(execution time)

From these, checkasm computes:

  • Mode: most likely execution time = exp(μ - σ²)
  • Median: middle execution time = exp(μ)
  • Mean: average execution time = exp(μ + σ²/2)
  • Standard deviation: sqrt(exp(2μ + σ²) × (exp(σ²) - 1))
  • Upper/Lower 95% confidence intervals: exp(μ ± 1.96 × σ)

Linear Regression

checkasm performs linear regression in log-space on the relationship between iteration count and total execution time:

log(per_call_time) = log(total_time) - log(iterations)

This approach:

  • Automatically handles the multiplicative nature of timing variations
  • Provides robust outlier rejection through regression residuals
  • Separates per-call time from measurement overhead
  • Computes confidence intervals for the estimates

Geometric Mean for Multiple Runs

When checkasm_bench_new() is called multiple times for the same function (e.g., testing different block sizes), the final reported value is the geometric mean of all measurements:

geometric_mean = (x₁ × x₂ × ... × xₙ)^(1/n)

The geometric mean is appropriate for performance measurements because it:

  • Is not skewed by outliers as heavily as arithmetic mean
  • Properly handles ratios and speedups across multiple orders of magnitude
  • Provides a representative "typical" performance across configurations

Overhead Correction

checkasm measures and subtracts the overhead of:

  1. The benchmarking loop itself
  2. The function call mechanism
  3. The timer read operation

This is done by measuring a no-op function and subtracting its measured time from all benchmark results. The no-op overhead is reported at startup:

- No-op overhead: 2.41 +/- 0.093 cycles per call (provisional)

For accuracy, this is re-estimated periodically throughout the benchmarking process to account for any drift. The final value is reported again at the end if --verbose is enabled.

Best Practices

System State

Despite all of the statistical techniques employed by checkasm to combat short- and medium-term noise, there is an unavoidable dependence on long-term changes in system state. For reliable benchmarking, consider:

Power Management:

  • Disable CPU frequency scaling if possible:
    # Linux: set CPU governor to performance mode
    sudo cpupower frequency-set --governor performance
    # Or for all CPUs
    for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance | sudo tee $cpu
    done
  • Disable turbo boost for consistent results:
    # Intel
    echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
    # AMD
    echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost

System Load:

  • Close unnecessary applications
  • Avoid running benchmarks on heavily loaded systems
  • Consider using nice to prioritize the benchmark process:
    nice -n -20 ./checkasm --bench

Thermal Throttling:

  • Ensure adequate cooling to prevent thermal throttling
  • Allow sufficient cool-down time between benchmark runs
  • Monitor CPU temperature during long benchmark sessions

Cache Alignment

checkasm automatically warms up caches before taking measurements, but you should be aware of cache effects:

  • L1/L2 Cache: Most optimized functions should fit in L2 cache
  • Data Working Set: Buffers allocated with BUF_RECT() are properly aligned and sized to avoid cache conflicts. Failure to do so may lead to performance hits from unaligned memory accesses or cache thrashing.

For functions with large working sets that don't fit in cache, benchmark results may reflect cache miss behavior, which is often realistic for real-world usage.

Buffer Alternation

Use checkasm_alternate() when benchmarking to prevent cache pollution:

// Good: alternates between dst_a and dst_c to prevent cache hits from
// previous iterations
checkasm_bench_new(checkasm_alternate(dst_a, dst_c), dst_a_stride,
src, src_stride, w, h);
// Acceptable: always writes to dst_a
checkasm_bench_new(dst_a, dst_a_stride, src, src_stride, w, h);

Alternating buffers ensures that benchmarks are not stalled by previous access to the same data buffer from the prior loop iteration.

Realistic Test Data

Use realistic input data for benchmarks:

// For general data processing: use INITIALIZE_BUF() which includes common edge cases
// For specific patterns: use domain-appropriate data
checkasm_randomize_normf(audio_buf, len); // Audio: normal distribution
// For worst-case analysis: test pathological inputs
generate_worst_case_pattern(buf, len);
CHECKASM_API void checkasm_randomize_normf(float *buf, int width)
Fill a float buffer with values from a standard normal distribution.

The input data can significantly affect performance due to:

  • Data-dependent branches in the implementation
  • SIMD instruction efficiency varying with data patterns
  • Cache behavior depending on data values

If your function's performance varies significantly with the input data or configuration, consider looping over all such configurations and running checkasm_bench_new() for each, to measure an overall average.

Choosing Configurations

When benchmarking functions that support multiple sizes or configurations:

Option 1: Benchmark all configurations

for (int w = 4; w <= 128; w <<= 1) {
if (checkasm_check_func(dsp->filter, "filter_w%d", w)) {
for (int h = 4; h <= 128; h <<= 1) {
// Test for correctness
// ...
// Benchmark each configuration
checkasm_bench_new(dst, dst_stride, src, src_stride, w, h);
}
// Reports geometric mean of all configurations
}
}

Option 2: Benchmark representative sizes separately

const int sizes[][2] = { {16, 16}, {16, 32}, {64, 16}, {64, 32} };
for (int i = 0; i < sizeof(sizes) / sizeof(sizes[0]); i++) {
int w = sizes[i][0], h = sizes[i][1];
// Separate check_func call = separate benchmark report
if (checkasm_check_func(dsp->filter, "filter_%dx%d", w, h)) {
// Test for correctness
// ...
checkasm_bench_new(dst, dst_stride, src, src_stride, w, h);
}
}

Option 3: Benchmark only the limiting case

for (int w = 4; w <= 128; w <<= 1) {
if (checkasm_check_func(dsp->filter, "filter_w%d", w)) {
for (int h = 4; h <= 128; h <<= 1) {
// Test all for correctness
// ...
}
// Benchmark only the largest size
checkasm_bench_new(dst, dst_stride, src, src_stride, w, 128);
}
}

Choose based on your needs:

  • Option 1: General performance across all sizes
  • Option 2: Specific performance for important sizes
  • Option 3: Best-case or worst-case performance

Interpreting Results

Understanding Output

checkasm's benchmark output provides several pieces of information. Pass --verbose to see all timing measurements.

Benchmark results:
name cycles +/- stddev time (nanoseconds) (vs ref)
nop: 2.4 +/- 0.0 1.2 ns +/- 0.1
filter_c: 64.9 +/- 75.4 33.8 ns +/- 40.8
filter_sse2: 46.9 +/- 13.3 24.4 ns +/- 7.8 ( 1.22x)
filter_avx2: 20.6 +/- 1.2 10.7 ns +/- 1.4 ( 3.15x)

Columns:

  • name: Function name (with ISA suffix for optimized versions)
  • cycles: Estimated CPU cycles per call (mean ± standard deviation)
  • time: Estimated nanoseconds per call (mean ± standard deviation)
  • (vs ref): Speedup relative to the reference (C) implementation

What to look for:

  • Lower cycle counts indicate better performance
  • Standard deviation shows measurement reliability
  • Speedup factors show optimization effectiveness

High Variance

High standard deviation (large ± values) can indicate:

  1. System noise: Background processes, interrupts, frequency scaling
  2. Data-dependent performance: Function runs faster/slower on different inputs
    • This may be legitimate behavior (e.g., early exit conditions)
    • Consider whether benchmark input is representative
  3. Cache effects: Function doesn't fit in cache or has cache conflicts
    • May be realistic for large working sets
    • Ensure buffers are properly aligned

The benchmark summary reports average timing error:

- average timing error: 17.371% across 15 benchmarks (maximum 65.404%)

High maximum error typically indicates at least one very noisy measurement.

Comparing Implementations

When comparing optimized implementations:

Absolute speedup:

filter_avx2: 20.6 cycles ( 3.15x)

This implementation is 3.15× faster than the C reference.

Relative comparison:

filter_sse2: 46.9 cycles ( 1.22x)
filter_avx2: 20.6 cycles ( 3.15x)

AVX2 is 46.9 / 20.6 = 2.28× faster than SSE2.

Regression Detection

Use benchmark results to detect performance regressions:

  1. Baseline measurements: Save benchmark results for your codebase:
    ./checkasm --bench --json > baseline.json
  2. After changes: Run benchmarks again:
    ./checkasm --bench --json > current.json
  3. Compare: Look for functions that got slower
    • Small variations (< 5%) are typically noise
    • Changes > 10% warrant investigation
    • Changes > 20% are likely real regressions or improvements

Advanced Topics

Microbenchmarking Pitfalls

Be aware of common microbenchmarking issues:

  • Dead Code Elimination: If the optimized function's results aren't used, the compiler might optimize it away, especially when compiling with link time optimization. This would usually be seen as unrealistically low cycle counts.
  • Constant Folding: Always use INITIALIZE_BUF() or RANDOMIZE_BUF() to ensure inputs aren't compile-time constants that could be folded away.
  • Branch Prediction: Running the same code path repeatedly (as benchmarks do) leads to perfect branch prediction, which may not reflect real-world performance that involve mixed function calls (e.g. varying block sizes). This is generally acceptable since you're comparing implementations under the same conditions, but may hide performance gains from e.g. branchless implementations.
  • Memory Hierarchy: Benchmarks often measure L1/L2 cache performance, not DRAM performance. For functions with large working sets, real-world performance may be lower than benchmarks suggest, which puts a bound on the realistically achievable speedup from SIMD optimizations.

Platform Considerations

Timer Resolution

checkasm reports timer resolution at startup:

- Timing source: x86 (rdtsc)
- Timing resolution: 0.5976 +/- 0.057 ns/cycle (1644 +/- 156.8 MHz) (provisional)
  • x86/x86_64: rdtsc (cycle counter) - very high resolution
  • ARM/AArch64: pmccntr (cycle counter) - high resolution
  • LoongArch: rdtime (tick counter) - high resolution
  • PowerPC 64le: mfspr (tick counter) - medium resolution
  • Other/Fallback: OS-provided timers - lower resolution

Lower resolution timers may require longer --duration for accurate results.

Frequency Scaling

The timer resolution includes clock frequency estimation. If CPU frequency scaling is enabled, this estimate may be inaccurate. However, this affects only the conversion to nanoseconds, not cycle counts. For most accurate results, disable frequency scaling, or compare only raw cycle counts (for platforms with access to high-resolution cycle counters).

Cross-Platform Comparison

Comparing cycle counts across different CPUs is meaningful when:

  • Both CPUs are from the same architecture family
  • Both run at similar clock speeds
  • You account for microarchitectural differences

For cross-platform comparison, use relative speedup (optimized vs C) rather than absolute cycle counts.

HTML Report Overview

The HTML report provides detailed statistical visualizations:

Kernel Density Estimate (left chart)

  • Shows the probability distribution of execution times
  • Peak indicates most likely execution time (mode)
  • Wider distribution = higher variance
  • Derived from log-normal distribution fit

Raw Measurements (right chart)

  • X-axis: iteration count (how many times function was called in one measurement)
  • Y-axis: total time for all iterations in one measurement
  • Line: linear regression fit
  • Shaded area: 95% confidence interval
  • Points far from line: potential outliers

Metrics Table

  • Adjusted cycles/time: After overhead subtraction (use this for comparisons)
  • Raw cycles/time: Before overhead subtraction (may be more reliable for sub-10-cycle functions)

Tips and Tricks

Reproducible Benchmarks

For reproducible results:

  1. Use fixed random seed:
    ./checkasm --bench 12345 # Use seed 12345
    This ensures the same test data patterns across runs.
  2. Document system state:
    • CPU model and frequency settings
    • Compiler version and flags
    • Operating system and kernel version
    • checkasm version/commit
  3. Multiple runs: Run benchmarks multiple times and verify consistency:
    for i in {1..5}; do
    ./checkasm --bench --function='filter_*' | tee run_$i.txt
    done