Performance Benchmarking with Criterion
Overview
This document provides a comprehensive guide to performance benchmarking in readstat-rs using Criterion.rs.
Quick Start
# Run all benchmarks (from the repository root)
cargo bench -p readstat
# View HTML reports (Criterion writes to the workspace-root target/)
open target/criterion/report/index.html
What Gets Benchmarked
1. Reading Performance
- Metadata Reading (
~300-950 µs) - File header parsing - Single Chunk Reading - Full dataset read performance
- Chunked Reading - Streaming with different chunk sizes (1K, 5K, 10K rows)
2. Data Conversion
- Arrow Conversion - SAS types → Arrow RecordBatch overhead
3. Writing Performance
- CSV Writing - Text format output
- Parquet Compression - Uncompressed, Snappy, Zstd comparison
- Format Comparison - CSV vs Parquet vs Feather vs NDJSON
4. Parallel Write Optimization
- Buffer Sizes - SpooledTempFile memory thresholds (1MB, 10MB, 100MB, 500MB)
5. End-to-End Pipeline
- Complete Conversion - Read + Write combined (most important)
Sample Results
From initial benchmark run (example output):
metadata_reading/all_types.sas7bdat
time: [299.41 µs 301.84 µs 304.29 µs]
metadata_reading/cars.sas7bdat
time: [935.21 µs 943.52 µs 952.41 µs]
read_single_chunk/cars.sas7bdat
time: [~2-3 ms]
thrpt: [~150-200K rows/sec]
write_parquet_compression/snappy
time: [~4-6 ms]
thrpt: [~70-100K rows/sec]
end_to_end_conversion/parquet
time: [~6-9 ms]
thrpt: [~50-70K rows/sec]
Interpreting Results
Understanding the Output
Time Measurement:
time: [299.41 µs 301.84 µs 304.29 µs]
^ ^ ^
| | +-- Upper bound (95% confidence)
| +------------ Median
+---------------------- Lower bound (95% confidence)
Throughput:
thrpt: [150K elem/s 175K elem/s 200K elem/s]
^ ^ ^
| | +-- Upper bound
| +-------------- Median
+-------------------------- Lower bound
Change Detection:
change: [-2.3456% -1.2345% +0.1234%] (p = 0.12 > 0.05)
^ ^ ^ ^
| | | +-- Statistical significance
| | +----------- Upper bound of change
| +--------------------- Median change
+------------------------------- Lower bound of change
What to Look For
🔴 Red Flags (Investigate)
- High variance (>10%) - Results unreliable
- Significant regression (>5% slower, p < 0.05)
- Outliers (>5% of samples)
🟡 Opportunities
- Chunked reading - Test if different chunk size improves throughput
- Buffer sizes - If small buffer performs as well as large, save memory
- Compression - If uncompressed only slightly faster, use compression
🟢 Validation
- Low variance (<5%) - Reliable results
- Improvements (>10% faster, p < 0.05)
- Expected patterns (e.g., compression should be slower but smaller)
Performance Optimization Workflow
Step 1: Establish Baseline
# Save current performance as baseline
cargo bench --save-baseline main
# Results saved to target/criterion/{benchmark}/main/
Step 2: Make Changes
Edit code with optimization hypothesis:
- Increase buffer size
- Change algorithm
- Add caching
- Parallel processing
Step 3: Measure Impact
# Compare against baseline
cargo bench --baseline main
# Look for "change: [X% Y% Z%]" in output
Step 4: Analyze & Iterate
If improved (>10%, p < 0.05):
✅ Keep the change
✅ Update baseline: cargo bench --save-baseline main
If no change (<5%): ⚠️ Optimization didn’t help - profile to find real bottleneck
If regressed (slower): ❌ Revert change ❌ Investigate why performance decreased
Common Optimization Scenarios
Scenario 1: Slow Reading
Symptoms: read_single_chunk time is high
Investigate:
- ReadStat C library overhead (FFI calls)
- Memory allocation patterns
- Callback overhead
Try:
- Larger buffers in C library
- Memory-mapped files (see evaluation doc)
- Pre-allocate column vectors
Scenario 2: Slow Writing
Symptoms: write_formats time is high
Investigate:
- BufWriter buffer size
- Format-specific overhead
- Compression CPU usage
Try:
- Increase BufWriter capacity (currently 8KB)
- Use faster compression (Snappy vs Zstd)
- Parallel writing (already implemented)
Scenario 3: Memory Issues
Symptoms: System swapping, OOM errors
Investigate:
- Chunk size too large
- Too many parallel streams
- Memory leaks
Try:
- Reduce
stream_rows(default 10,000) - Reduce parallel write buffer (default 100MB)
- Use bounded channels (already implemented)
Scenario 4: High Variance
Symptoms: Large confidence intervals, many outliers
Investigate:
- System background activity
- CPU frequency scaling
- Thermal throttling
Try:
- Close background apps
- Disable frequency scaling
- Run on consistent power mode
Advanced Profiling
CPU Profiling with Flamegraphs
# Install flamegraph
cargo install flamegraph
# Profile a specific benchmark
cargo flamegraph --bench readstat_benchmarks -- --bench read_single_chunk
# Open flamegraph.svg to see hotspots
What to look for:
- Wide bars = lots of time spent
- Deep stacks = call overhead
- Unexpected functions = bugs/inefficiency
Memory Profiling
# Using valgrind (Linux)
valgrind --tool=massif \
cargo bench read_single_chunk --no-run
ms_print massif.out.* > memory_profile.txt
# Using heaptrack (Linux)
heaptrack cargo bench read_single_chunk
heaptrack_gui heaptrack.*.gz
System Call Tracing
# Linux: strace
strace -c cargo bench read_single_chunk 2>&1 | tail -20
# macOS: dtruss
sudo dtruss -c cargo bench read_single_chunk
Comparing Implementations
Before/After Memory-Mapped Files
# Baseline without mmap
git checkout main
cargo bench --save-baseline without-mmap
# With mmap implementation
git checkout feature/mmap
cargo bench --baseline without-mmap
# Look for improvements in read_single_chunk
Parallel vs Sequential
# Test with different parallelism settings
cargo bench end_to_end -- --parallel
cargo bench end_to_end -- --sequential
CI/CD Integration
Performance Regression Detection
Add to .github/workflows/benchmarks.yml:
name: Performance Benchmarks
on:
pull_request:
branches: [main]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Rust
uses: dtolnay/rust-toolchain@stable
- name: Run benchmarks
run: |
cd crates/readstat
cargo bench --no-run # Just compile for CI
- name: Compare with baseline (on main branch)
if: github.event_name == 'pull_request'
run: |
git fetch origin main:main
git checkout main
cargo bench --save-baseline main
git checkout -
cargo bench --baseline main
Best Practices
Do’s ✅
- Run benchmarks on consistent hardware
- Close background applications
- Use
--save-baselinefor comparisons - Profile after benchmarking to find bottlenecks
- Document performance changes in PRs
- Test on representative data sizes
Don’ts ❌
- Don’t benchmark on laptop (throttling)
- Don’t optimize without profiling first
- Don’t trust results with high variance
- Don’t compare across different systems
- Don’t commit benchmark artifacts
- Don’t skip statistical significance checks
Performance Goals
Current Performance (Baseline)
- Metadata reading: ~300-950 µs
- Read throughput: ~150-200K rows/sec
- Write throughput: ~70-100K rows/sec
- End-to-end: ~50-70K rows/sec
Target Performance (Goals)
- Metadata reading: <500 µs (↓30%)
- Read throughput: >250K rows/sec (↑25%)
- Write throughput: >100K rows/sec (↑30%)
- End-to-end: >100K rows/sec (↑40%)
Stretch Goals
- Memory-mapped reads: 2x faster for large files
- Parallel writes: 3-4x speedup with 4+ cores
- Compression: <10% overhead for Snappy
Data Files for Benchmarking
Current Test Data
- all_types.sas7bdat - 3 rows, 10 vars (tiny)
- cars.sas7bdat - 1081 rows, 13 vars (small)
Recommended Additional Data
For comprehensive benchmarking, consider adding:
Small (good for quick iteration):
- < 1 MB file size
- < 1,000 rows
- 5-10 variables
Medium (typical use case):
- 10-100 MB file size
- 10,000-100,000 rows
- 10-50 variables
Large (stress test):
-
1 GB file size
-
1,000,000 rows
- 50+ variables
Resources
Documentation
Tools
- cargo-flamegraph
- cargo-benchcmp
- hyperfine - CLI benchmarking (see below)
Blog Posts
Next Steps
- Run full benchmark suite:
cargo bench - Review HTML reports: Open
target/criterion/report/index.html - Identify bottlenecks: Look for slowest operations
- Profile with flamegraph: Focus on hotspots
- Implement optimizations: Test one at a time
- Validate improvements: Compare against baseline
- Document findings: Update this file with results
Questions?
- See detailed README:
crates/readstat/benches/README.md - Check Criterion docs: https://bheisler.github.io/criterion.rs/book/
- Review performance evaluation: Memory-mapped files analysis (separate doc)
Benchmarking with hyperfine
Benchmarking performed with hyperfine.
This example compares the performance of the Rust binary with the performance of the C binary built from the ReadStat repository. In general, hope that performance is fairly close to that of the C binary.
To run, execute the following from within the readstat directory.
# Windows
hyperfine --warmup 5 "ReadStat_App.exe -f crates\readstat-tests\tests\data\cars.sas7bdat tests\data\cars_c.csv" ".\target\release\readstat.exe data crates\readstat-tests\tests\data\cars.sas7bdat --output crates\readstat-tests\tests\data\cars_rust.csv"
📝 First experiments on Windows are challenging to interpret due to file caching. Need further research into utilizing the --prepare option provided by hyperfine on Windows.
# Linux and macOS
hyperfine --prepare "sync; echo 3 | sudo tee /proc/sys/vm/drop_caches" "readstat -f crates/readstat-tests/tests/data/cars.sas7bdat crates/readstat-tests/tests/data/cars_c.csv" "./target/release/readstat data tests/data/cars.sas7bdat --output crates/readstat-tests/tests/data/cars_rust.csv"
Other, future, benchmarking may be performed now that channels and threads have been developed.
Profiling with Flamegraphs
Profiling performed with cargo flamegraph.
The readstat binary lives in the readstat-cli crate, so target it with -p readstat-cli. Run the following from the repository root.
cargo flamegraph -p readstat-cli --bin readstat -- data crates/readstat-tests/tests/data/_ahs2019n.sas7bdat --output crates/readstat-tests/tests/data/_ahs2019n.csv
Flamegraph is written to flamegraph.svg in the directory you run the command from (the repository root).
📝 Have yet to utilize flamegraphs in order to improve performance.