Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Performance Benchmarking with Criterion

Overview

This document provides a comprehensive guide to performance benchmarking in readstat-rs using Criterion.rs.

Quick Start

# Run all benchmarks
cd crates/readstat
cargo bench

# View HTML reports
open target/criterion/report/index.html

What Gets Benchmarked

1. Reading Performance

  • Metadata Reading (~300-950 µs) - File header parsing
  • Single Chunk Reading - Full dataset read performance
  • Chunked Reading - Streaming with different chunk sizes (1K, 5K, 10K rows)

2. Data Conversion

  • Arrow Conversion - SAS types → Arrow RecordBatch overhead

3. Writing Performance

  • CSV Writing - Text format output
  • Parquet Compression - Uncompressed, Snappy, Zstd comparison
  • Format Comparison - CSV vs Parquet vs Feather vs NDJSON

4. Parallel Write Optimization

  • Buffer Sizes - SpooledTempFile memory thresholds (1MB, 10MB, 100MB, 500MB)

5. End-to-End Pipeline

  • Complete Conversion - Read + Write combined (most important)

Sample Results

From initial benchmark run (example output):

metadata_reading/all_types.sas7bdat
                        time:   [299.41 µs 301.84 µs 304.29 µs]

metadata_reading/cars.sas7bdat
                        time:   [935.21 µs 943.52 µs 952.41 µs]

read_single_chunk/cars.sas7bdat
                        time:   [~2-3 ms]
                        thrpt:  [~150-200K rows/sec]

write_parquet_compression/snappy
                        time:   [~4-6 ms]
                        thrpt:  [~70-100K rows/sec]

end_to_end_conversion/parquet
                        time:   [~6-9 ms]
                        thrpt:  [~50-70K rows/sec]

Interpreting Results

Understanding the Output

Time Measurement:

time: [299.41 µs 301.84 µs 304.29 µs]
       ^         ^         ^
       |         |         +-- Upper bound (95% confidence)
       |         +------------ Median
       +---------------------- Lower bound (95% confidence)

Throughput:

thrpt: [150K elem/s 175K elem/s 200K elem/s]
        ^           ^           ^
        |           |           +-- Upper bound
        |           +-------------- Median
        +-------------------------- Lower bound

Change Detection:

change: [-2.3456% -1.2345% +0.1234%] (p = 0.12 > 0.05)
         ^         ^         ^        ^
         |         |         |        +-- Statistical significance
         |         |         +----------- Upper bound of change
         |         +--------------------- Median change
         +------------------------------- Lower bound of change

What to Look For

🔴 Red Flags (Investigate)

  • High variance (>10%) - Results unreliable
  • Significant regression (>5% slower, p < 0.05)
  • Outliers (>5% of samples)

🟡 Opportunities

  • Chunked reading - Test if different chunk size improves throughput
  • Buffer sizes - If small buffer performs as well as large, save memory
  • Compression - If uncompressed only slightly faster, use compression

🟢 Validation

  • Low variance (<5%) - Reliable results
  • Improvements (>10% faster, p < 0.05)
  • Expected patterns (e.g., compression should be slower but smaller)

Performance Optimization Workflow

Step 1: Establish Baseline

# Save current performance as baseline
cargo bench --save-baseline main

# Results saved to target/criterion/{benchmark}/main/

Step 2: Make Changes

Edit code with optimization hypothesis:

  • Increase buffer size
  • Change algorithm
  • Add caching
  • Parallel processing

Step 3: Measure Impact

# Compare against baseline
cargo bench --baseline main

# Look for "change: [X% Y% Z%]" in output

Step 4: Analyze & Iterate

If improved (>10%, p < 0.05): ✅ Keep the change ✅ Update baseline: cargo bench --save-baseline main

If no change (<5%): ⚠️ Optimization didn’t help - profile to find real bottleneck

If regressed (slower): ❌ Revert change ❌ Investigate why performance decreased

Common Optimization Scenarios

Scenario 1: Slow Reading

Symptoms: read_single_chunk time is high

Investigate:

  1. ReadStat C library overhead (FFI calls)
  2. Memory allocation patterns
  3. Callback overhead

Try:

  • Larger buffers in C library
  • Memory-mapped files (see evaluation doc)
  • Pre-allocate column vectors

Scenario 2: Slow Writing

Symptoms: write_formats time is high

Investigate:

  1. BufWriter buffer size
  2. Format-specific overhead
  3. Compression CPU usage

Try:

  • Increase BufWriter capacity (currently 8KB)
  • Use faster compression (Snappy vs Zstd)
  • Parallel writing (already implemented)

Scenario 3: Memory Issues

Symptoms: System swapping, OOM errors

Investigate:

  1. Chunk size too large
  2. Too many parallel streams
  3. Memory leaks

Try:

  • Reduce stream_rows (default 10,000)
  • Reduce parallel write buffer (default 100MB)
  • Use bounded channels (already implemented)

Scenario 4: High Variance

Symptoms: Large confidence intervals, many outliers

Investigate:

  1. System background activity
  2. CPU frequency scaling
  3. Thermal throttling

Try:

  • Close background apps
  • Disable frequency scaling
  • Run on consistent power mode

Advanced Profiling

CPU Profiling with Flamegraphs

# Install flamegraph
cargo install flamegraph

# Profile a specific benchmark
cargo flamegraph --bench readstat_benchmarks -- --bench read_single_chunk

# Open flamegraph.svg to see hotspots

What to look for:

  • Wide bars = lots of time spent
  • Deep stacks = call overhead
  • Unexpected functions = bugs/inefficiency

Memory Profiling

# Using valgrind (Linux)
valgrind --tool=massif \
  cargo bench read_single_chunk --no-run
ms_print massif.out.* > memory_profile.txt

# Using heaptrack (Linux)
heaptrack cargo bench read_single_chunk
heaptrack_gui heaptrack.*.gz

System Call Tracing

# Linux: strace
strace -c cargo bench read_single_chunk 2>&1 | tail -20

# macOS: dtruss
sudo dtruss -c cargo bench read_single_chunk

Comparing Implementations

Before/After Memory-Mapped Files

# Baseline without mmap
git checkout main
cargo bench --save-baseline without-mmap

# With mmap implementation
git checkout feature/mmap
cargo bench --baseline without-mmap

# Look for improvements in read_single_chunk

Parallel vs Sequential

# Test with different parallelism settings
cargo bench end_to_end -- --parallel
cargo bench end_to_end -- --sequential

CI/CD Integration

Performance Regression Detection

Add to .github/workflows/benchmarks.yml:

name: Performance Benchmarks

on:
  pull_request:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Rust
        uses: dtolnay/rust-toolchain@stable

      - name: Run benchmarks
        run: |
          cd crates/readstat
          cargo bench --no-run  # Just compile for CI

      - name: Compare with baseline (on main branch)
        if: github.event_name == 'pull_request'
        run: |
          git fetch origin main:main
          git checkout main
          cargo bench --save-baseline main
          git checkout -
          cargo bench --baseline main

Best Practices

Do’s ✅

  • Run benchmarks on consistent hardware
  • Close background applications
  • Use --save-baseline for comparisons
  • Profile after benchmarking to find bottlenecks
  • Document performance changes in PRs
  • Test on representative data sizes

Don’ts ❌

  • Don’t benchmark on laptop (throttling)
  • Don’t optimize without profiling first
  • Don’t trust results with high variance
  • Don’t compare across different systems
  • Don’t commit benchmark artifacts
  • Don’t skip statistical significance checks

Performance Goals

Current Performance (Baseline)

  • Metadata reading: ~300-950 µs
  • Read throughput: ~150-200K rows/sec
  • Write throughput: ~70-100K rows/sec
  • End-to-end: ~50-70K rows/sec

Target Performance (Goals)

  • Metadata reading: <500 µs (↓30%)
  • Read throughput: >250K rows/sec (↑25%)
  • Write throughput: >100K rows/sec (↑30%)
  • End-to-end: >100K rows/sec (↑40%)

Stretch Goals

  • Memory-mapped reads: 2x faster for large files
  • Parallel writes: 3-4x speedup with 4+ cores
  • Compression: <10% overhead for Snappy

Data Files for Benchmarking

Current Test Data

  • all_types.sas7bdat - 3 rows, 10 vars (tiny)
  • cars.sas7bdat - 1081 rows, 13 vars (small)

For comprehensive benchmarking, consider adding:

Small (good for quick iteration):

  • < 1 MB file size
  • < 1,000 rows
  • 5-10 variables

Medium (typical use case):

  • 10-100 MB file size
  • 10,000-100,000 rows
  • 10-50 variables

Large (stress test):

  • 1 GB file size

  • 1,000,000 rows

  • 50+ variables

Resources

Documentation

Tools

Blog Posts

Next Steps

  1. Run full benchmark suite: cargo bench
  2. Review HTML reports: Open target/criterion/report/index.html
  3. Identify bottlenecks: Look for slowest operations
  4. Profile with flamegraph: Focus on hotspots
  5. Implement optimizations: Test one at a time
  6. Validate improvements: Compare against baseline
  7. Document findings: Update this file with results

Questions?

  • See detailed README: crates/readstat/benches/README.md
  • Check Criterion docs: https://bheisler.github.io/criterion.rs/book/
  • Review performance evaluation: Memory-mapped files analysis (separate doc)

Benchmarking with hyperfine

Benchmarking performed with hyperfine.

This example compares the performance of the Rust binary with the performance of the C binary built from the ReadStat repository. In general, hope that performance is fairly close to that of the C binary.

To run, execute the following from within the readstat directory.

# Windows
hyperfine --warmup 5 "ReadStat_App.exe -f crates\readstat-tests\tests\data\cars.sas7bdat tests\data\cars_c.csv" ".\target\release\readstat.exe data crates\readstat-tests\tests\data\cars.sas7bdat --output crates\readstat-tests\tests\data\cars_rust.csv"

📝 First experiments on Windows are challenging to interpret due to file caching. Need further research into utilizing the --prepare option provided by hyperfine on Windows.

# Linux and macOS
hyperfine --prepare "sync; echo 3 | sudo tee /proc/sys/vm/drop_caches" "readstat -f crates/readstat-tests/tests/data/cars.sas7bdat crates/readstat-tests/tests/data/cars_c.csv" "./target/release/readstat data tests/data/cars.sas7bdat --output crates/readstat-tests/tests/data/cars_rust.csv"

Other, future, benchmarking may be performed now that channels and threads have been developed.

Profiling with Flamegraphs

Profiling performed with cargo flamegraph.

To run, execute the following from within the readstat directory.

cargo flamegraph --bin readstat -- data tests/data/_ahs2019n.sas7bdat --output tests/data/_ahs2019n.csv

Flamegraph is written to readstat/flamegraph.svg.

📝 Have yet to utilize flamegraphs in order to improve performance.