Performance Benchmarking with Criterion

Overview

This document provides a comprehensive guide to performance benchmarking in readstat-rs using Criterion.rs.

Quick Start

# Run all benchmarks
cd crates/readstat
cargo bench

# View HTML reports
open target/criterion/report/index.html

What Gets Benchmarked

1. Reading Performance

Metadata Reading (~300-950 µs) - File header parsing
Single Chunk Reading - Full dataset read performance
Chunked Reading - Streaming with different chunk sizes (1K, 5K, 10K rows)

2. Data Conversion

Arrow Conversion - SAS types → Arrow RecordBatch overhead

3. Writing Performance

CSV Writing - Text format output
Parquet Compression - Uncompressed, Snappy, Zstd comparison
Format Comparison - CSV vs Parquet vs Feather vs NDJSON

4. Parallel Write Optimization

Buffer Sizes - SpooledTempFile memory thresholds (1MB, 10MB, 100MB, 500MB)

5. End-to-End Pipeline

Complete Conversion - Read + Write combined (most important)

Sample Results

From initial benchmark run (example output):

metadata_reading/all_types.sas7bdat
                        time:   [299.41 µs 301.84 µs 304.29 µs]

metadata_reading/cars.sas7bdat
                        time:   [935.21 µs 943.52 µs 952.41 µs]

read_single_chunk/cars.sas7bdat
                        time:   [~2-3 ms]
                        thrpt:  [~150-200K rows/sec]

write_parquet_compression/snappy
                        time:   [~4-6 ms]
                        thrpt:  [~70-100K rows/sec]

end_to_end_conversion/parquet
                        time:   [~6-9 ms]
                        thrpt:  [~50-70K rows/sec]

Interpreting Results

Understanding the Output

Time Measurement:

time: [299.41 µs 301.84 µs 304.29 µs]
       ^         ^         ^
       |         |         +-- Upper bound (95% confidence)
       |         +------------ Median
       +---------------------- Lower bound (95% confidence)

Throughput:

thrpt: [150K elem/s 175K elem/s 200K elem/s]
        ^           ^           ^
        |           |           +-- Upper bound
        |           +-------------- Median
        +-------------------------- Lower bound

Change Detection:

change: [-2.3456% -1.2345% +0.1234%] (p = 0.12 > 0.05)
         ^         ^         ^        ^
         |         |         |        +-- Statistical significance
         |         |         +----------- Upper bound of change
         |         +--------------------- Median change
         +------------------------------- Lower bound of change

What to Look For

🔴 Red Flags (Investigate)

High variance (>10%) - Results unreliable
Significant regression (>5% slower, p < 0.05)
Outliers (>5% of samples)

🟡 Opportunities

Chunked reading - Test if different chunk size improves throughput
Buffer sizes - If small buffer performs as well as large, save memory
Compression - If uncompressed only slightly faster, use compression

🟢 Validation

Low variance (<5%) - Reliable results
Improvements (>10% faster, p < 0.05)
Expected patterns (e.g., compression should be slower but smaller)

Performance Optimization Workflow

Step 1: Establish Baseline

# Save current performance as baseline
cargo bench --save-baseline main

# Results saved to target/criterion/{benchmark}/main/

Step 2: Make Changes

Edit code with optimization hypothesis:

Increase buffer size
Change algorithm
Add caching
Parallel processing

Step 3: Measure Impact

# Compare against baseline
cargo bench --baseline main

# Look for "change: [X% Y% Z%]" in output

Step 4: Analyze & Iterate

If improved (>10%, p < 0.05): ✅ Keep the change ✅ Update baseline: cargo bench --save-baseline main

If no change (<5%): ⚠️ Optimization didn’t help - profile to find real bottleneck

If regressed (slower): ❌ Revert change ❌ Investigate why performance decreased

Common Optimization Scenarios

Scenario 1: Slow Reading

Symptoms: read_single_chunk time is high

Investigate:

ReadStat C library overhead (FFI calls)
Memory allocation patterns
Callback overhead

Try:

Larger buffers in C library
Memory-mapped files (see evaluation doc)
Pre-allocate column vectors

Scenario 2: Slow Writing

Symptoms: write_formats time is high

Investigate:

BufWriter buffer size
Format-specific overhead
Compression CPU usage

Try:

Increase BufWriter capacity (currently 8KB)
Use faster compression (Snappy vs Zstd)
Parallel writing (already implemented)

Scenario 3: Memory Issues

Symptoms: System swapping, OOM errors

Investigate:

Chunk size too large
Too many parallel streams
Memory leaks

Try:

Reduce stream_rows (default 10,000)
Reduce parallel write buffer (default 100MB)
Use bounded channels (already implemented)

Scenario 4: High Variance

Symptoms: Large confidence intervals, many outliers

Investigate:

System background activity
CPU frequency scaling
Thermal throttling

Try:

Close background apps
Disable frequency scaling
Run on consistent power mode

Advanced Profiling

CPU Profiling with Flamegraphs

# Install flamegraph
cargo install flamegraph

# Profile a specific benchmark
cargo flamegraph --bench readstat_benchmarks -- --bench read_single_chunk

# Open flamegraph.svg to see hotspots

What to look for:

Wide bars = lots of time spent
Deep stacks = call overhead
Unexpected functions = bugs/inefficiency

Memory Profiling

# Using valgrind (Linux)
valgrind --tool=massif \
  cargo bench read_single_chunk --no-run
ms_print massif.out.* > memory_profile.txt

# Using heaptrack (Linux)
heaptrack cargo bench read_single_chunk
heaptrack_gui heaptrack.*.gz

System Call Tracing

# Linux: strace
strace -c cargo bench read_single_chunk 2>&1 | tail -20

# macOS: dtruss
sudo dtruss -c cargo bench read_single_chunk

Comparing Implementations

Before/After Memory-Mapped Files

# Baseline without mmap
git checkout main
cargo bench --save-baseline without-mmap

# With mmap implementation
git checkout feature/mmap
cargo bench --baseline without-mmap

# Look for improvements in read_single_chunk

Parallel vs Sequential

# Test with different parallelism settings
cargo bench end_to_end -- --parallel
cargo bench end_to_end -- --sequential

CI/CD Integration

Performance Regression Detection

Add to .github/workflows/benchmarks.yml:

name: Performance Benchmarks

on:
  pull_request:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Rust
        uses: dtolnay/rust-toolchain@stable

      - name: Run benchmarks
        run: |
          cd crates/readstat
          cargo bench --no-run  # Just compile for CI

      - name: Compare with baseline (on main branch)
        if: github.event_name == 'pull_request'
        run: |
          git fetch origin main:main
          git checkout main
          cargo bench --save-baseline main
          git checkout -
          cargo bench --baseline main

Best Practices

Do’s ✅

Run benchmarks on consistent hardware
Close background applications
Use --save-baseline for comparisons
Profile after benchmarking to find bottlenecks
Document performance changes in PRs
Test on representative data sizes

Don’ts ❌

Don’t benchmark on laptop (throttling)
Don’t optimize without profiling first
Don’t trust results with high variance
Don’t compare across different systems
Don’t commit benchmark artifacts
Don’t skip statistical significance checks

Performance Goals

Current Performance (Baseline)

Metadata reading: ~300-950 µs
Read throughput: ~150-200K rows/sec
Write throughput: ~70-100K rows/sec
End-to-end: ~50-70K rows/sec

Target Performance (Goals)

Metadata reading: <500 µs (↓30%)
Read throughput: >250K rows/sec (↑25%)
Write throughput: >100K rows/sec (↑30%)
End-to-end: >100K rows/sec (↑40%)

Stretch Goals

Memory-mapped reads: 2x faster for large files
Parallel writes: 3-4x speedup with 4+ cores
Compression: <10% overhead for Snappy

Data Files for Benchmarking

Current Test Data

all_types.sas7bdat - 3 rows, 10 vars (tiny)
cars.sas7bdat - 1081 rows, 13 vars (small)

Recommended Additional Data

For comprehensive benchmarking, consider adding:

Small (good for quick iteration):

< 1 MB file size
< 1,000 rows
5-10 variables

Medium (typical use case):

10-100 MB file size
10,000-100,000 rows
10-50 variables

Large (stress test):

1 GB file size
1,000,000 rows
50+ variables

Resources

Tools

cargo-flamegraph
cargo-benchcmp
hyperfine - CLI benchmarking (see below)

Next Steps

Run full benchmark suite: cargo bench
Review HTML reports: Open target/criterion/report/index.html
Identify bottlenecks: Look for slowest operations
Profile with flamegraph: Focus on hotspots
Implement optimizations: Test one at a time
Validate improvements: Compare against baseline
Document findings: Update this file with results

Questions?

See detailed README: crates/readstat/benches/README.md
Check Criterion docs: https://bheisler.github.io/criterion.rs/book/
Review performance evaluation: Memory-mapped files analysis (separate doc)

This example compares the performance of the Rust binary with the performance of the C binary built from the ReadStat repository. In general, hope that performance is fairly close to that of the C binary.

To run, execute the following from within the readstat directory.

# Windows
hyperfine --warmup 5 "ReadStat_App.exe -f crates\readstat-tests\tests\data\cars.sas7bdat tests\data\cars_c.csv" ".\target\release\readstat.exe data crates\readstat-tests\tests\data\cars.sas7bdat --output crates\readstat-tests\tests\data\cars_rust.csv"

📝 First experiments on Windows are challenging to interpret due to file caching. Need further research into utilizing the --prepare option provided by hyperfine on Windows.

# Linux and macOS
hyperfine --prepare "sync; echo 3 | sudo tee /proc/sys/vm/drop_caches" "readstat -f crates/readstat-tests/tests/data/cars.sas7bdat crates/readstat-tests/tests/data/cars_c.csv" "./target/release/readstat data tests/data/cars.sas7bdat --output crates/readstat-tests/tests/data/cars_rust.csv"

Other, future, benchmarking may be performed now that channels and threads have been developed.

Profiling with Flamegraphs

Profiling performed with cargo flamegraph.

To run, execute the following from within the readstat directory.

cargo flamegraph --bin readstat -- data tests/data/_ahs2019n.sas7bdat --output tests/data/_ahs2019n.csv

Flamegraph is written to readstat/flamegraph.svg.

📝 Have yet to utilize flamegraphs in order to improve performance.

Keyboard shortcuts

readstat-rs