readstat-rs
Read, inspect, and convert SAS binary (.sas7bdat) files β from Rust code, the command line, or the browser. Converts to CSV, Parquet, Feather, and NDJSON using Apache Arrow.
The original use case was a command-line tool for converting SAS files, but the project has since expanded into a workspace of crates that can be used as a Rust library, a CLI, or compiled to WebAssembly for browser and JavaScript runtimes.
π Dependencies
The command-line tool is developed in Rust and is only possible due to the following excellent projects:
- The ReadStat C library developed by Evan Miller
- The arrow Rust crate developed by the Apache Arrow community
The ReadStat library is used to parse and read sas7bdat files, and the arrow crate is used to convert the read sas7bdat data into the Arrow memory format. Once in the Arrow memory format, the data can be written to other file formats.
π‘ Note: The ReadStat C library supports SAS, SPSS, and Stata file formats. The
readstat-syscrate exposes the full ReadStat API β all 125 functions across all formats. However, the higher-level crates (readstat,readstat-cli,readstat-wasm,readstat-tests) currently only implement support for SAS.sas7bdatfiles.
π CLI Quickstart
Convert the first 50,000 rows of example.sas7bdat (by performing the read in parallel) to the file example.parquet, overwriting the file if it already exists.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --rows 50000 --overwrite --parallel
π¦ CLI Install
Download a Release
[Mostly] static binaries for Linux, macOS, and Windows may be found at the Releases page.
Setup
Move the readstat binary to a known directory and add the binary to the userβs PATH.
Linux & macOS
Ensure the path to readstat is added to the appropriate shell configuration file.
Windows
For Windows users, path configuration may be found within the Environment Variables menu. Executing the following from the command line opens the Environment Variables menu for the current user.
rundll32.exe sysdm.cpl,EditEnvironmentVariables
Alternatively, update the user-level PATH in PowerShell (replace C:\path\to\readstat with the actual directory):
$currentPath = [Environment]::GetEnvironmentVariable("Path", "User")
[Environment]::SetEnvironmentVariable("Path", "$currentPath;C:\path\to\readstat", "User")
After running the above, restart your terminal for the change to take effect.
Run
Run the binary.
readstat --help
βοΈ CLI Usage
The binary is invoked using subcommands:
metadataβ writes file and variable metadata to standard out or JSONpreviewβ writes the first N rows of parsed data ascsvto standard outdataβ writes parsed data incsv,feather,ndjson, orparquetformat to a file
Column metadata β labels, SAS format strings, and storage widths β is preserved in Parquet and Feather output as Arrow field metadata. See docs/TECHNICAL.md for details.
For the full CLI reference β including column selection, parallelism, memory considerations, SQL queries, reader modes, and debug options β see docs/USAGE.md.
For library, API server, and WebAssembly usage, see Examples below.
π οΈ Build from Source
Clone the repository (with submodules), install platform-specific developer tools, and run cargo build. Platform-specific instructions for Linux, macOS, and Windows are in docs/BUILDING.md.
π» Platform Support
| Platform | Status | C library | Notes |
|---|---|---|---|
| Linux (glibc) | β Builds and runs | System iconv, system zlib | β |
| Linux (musl) | β Builds and runs | System iconv, system zlib | β |
| macOS | β Builds and runs | System libiconv, system zlib | β |
| Windows (MSVC) | β Builds and runs | Vendored iconv, vendored zlib | Requires libclang for bindgen. MSVC supported since ReadStat 1.1.5 (no msys2 needed). |
π Documentation
| Document | Description |
|---|---|
| docs/ARCHITECTURE.md | Crate layout, key types, and architectural patterns |
| docs/USAGE.md | Full CLI reference and examples |
| docs/BUILDING.md | Clone, build, and linking details per platform |
| docs/TECHNICAL.md | Floating-point precision and date/time handling |
| docs/TESTING.md | Running tests, dataset table, valgrind |
| docs/BENCHMARKING.md | Criterion benchmarks, hyperfine, and profiling |
| docs/CI-CD.md | GitHub Actions triggers and artifacts |
| docs/MEMORY_SAFETY.md | Automated memory-safety CI checks (Valgrind, ASan, Miri, unsafe audit) |
| docs/RELEASING.md | Step-by-step guide for publishing crates to crates.io |
π§© Workspace Crates
| Crate | Path | Description |
|---|---|---|
readstat | crates/readstat/ | Pure library for parsing SAS files into Arrow RecordBatch format. Output writers are feature-gated. |
readstat-cli | crates/readstat-cli/ | Binary crate producing the readstat CLI tool (arg parsing, progress bars, orchestration). |
readstat-sys | crates/readstat-sys/ | Raw FFI bindings to the full ReadStat C library (SAS, SPSS, Stata) via bindgen. |
readstat-iconv-sys | crates/readstat-iconv-sys/ | Windows-only FFI bindings to libiconv for character encoding conversion. |
readstat-tests | crates/readstat-tests/ | Integration test suite (29 modules, 14 datasets). |
readstat-wasm | crates/readstat-wasm/ | WebAssembly build for browser/JS usage (excluded from workspace, built with Emscripten). |
For full architectural details, see docs/ARCHITECTURE.md.
π‘ Examples
The examples/ directory contains runnable demos showing different ways to use readstat-rs.
| Example | Description |
|---|---|
cli-demo | Convert a .sas7bdat file to CSV, NDJSON, Parquet, and Feather using the readstat CLI |
api-demo | API servers in Rust (Axum) and Python (FastAPI + PyO3) β upload, inspect, and convert SAS files over HTTP |
bun-demo | Parse a .sas7bdat file from JavaScript using the WebAssembly build with Bun |
web-demo | Browser-based viewer and converter β upload, preview, and export entirely client-side via WASM |
sql-explorer | Browser-based SQL explorer β upload a .sas7bdat file and query it interactively with SQL via AlaSQL |
To use readstat as a library in your own Rust project, add the readstat crate as a dependency.
π Resources
The following have been incredibly helpful while developing!
- How to not RiiR
- Making a *-sys crate
- Rust Closures in FFI
- Rust FFI: Microsoft Flight Simulator SDK
- Stack Overflow answers by Jake Goulding
- ReadStat pull request to add MSVC/Windows support
- jamovi-readstat appveyor.yml file to build ReadStat on Windows
- Arrow documentation for utilizing ArrayBuilders
Building from Source
Clone
Ensure submodules are also cloned.
git clone --recurse-submodules https://github.com/curtisalexander/readstat-rs.git
The ReadStat repository is included as a git submodule within this repository. In order to build and link, first a readstat-sys crate is created. Then the readstat library and readstat-cli binary crate utilize readstat-sys as a dependency.
Linux
Install developer tools
sudo apt install build-essential clang
Build
cargo build
iconv: Linked dynamically against the system-provided library. On most distributions it is available by default. No explicit link directives are emitted in the build script β the system linker resolves it automatically.
zlib: Linked via the libz-sys crate, which will use the system-provided zlib if available or compile from source as a fallback.
macOS
Install developer tools
xcode-select --install
Build
cargo build
iconv: Linked dynamically against the system-provided library that ships with macOS (via cargo:rustc-link-lib=iconv in the readstat-sys build script). No additional packages need to be installed.
zlib: Linked via the libz-sys crate, which will use the system-provided zlib that ships with macOS.
Windows
Building on Windows requires LLVM and Visual Studio C++ Build tools be downloaded and installed.
In addition, the path to libclang needs to be set in the environment variable LIBCLANG_PATH. If LIBCLANG_PATH is not set, the readstat-sys build script will check the default path C:\Program Files\LLVM\lib and fail with instructions if it does not exist.
For details see the following.
Build
cargo build
iconv: Compiled from source using the vendored libiconv-win-build submodule (located at crates/readstat-iconv-sys/vendor/libiconv-win-build/) via the readstat-iconv-sys crate. readstat-iconv-sys is a Windows-only dependency (gated behind [target.'cfg(windows)'.dependencies] in readstat-sys/Cargo.toml).
zlib: Compiled from source via the libz-sys crate (statically linked).
Linking Summary
| Platform | iconv | zlib |
|---|---|---|
| Linux (glibc/musl) | Dynamic (system) | libz-sys (prefers system, falls back to source) |
| macOS (x86/ARM) | Dynamic (system) | libz-sys (uses system) |
| Windows (MSVC) | Static (vendored submodule) | libz-sys (compiled from source, static) |
Usage
After either building or installing, the binary is invoked using subcommands. Currently, the following subcommands have been implemented:
metadataβ writes the following to standard out or json- row count
- variable count
- table name
- table label
- file encoding
- format version
- bitness
- creation time
- modified time
- compression
- byte order
- variable names
- variable type classes
- variable types
- variable labels
- variable format classes
- variable formats
- arrow data types
previewβ writes the first 10 rows (or optionally the number of rows provided by the user) of parsed data incsvformat to standard outdataβ writes parsed data incsv,feather,ndjson, orparquetformat to a file
Metadata
To write metadata to standard out, invoke the following.
readstat metadata /some/dir/to/example.sas7bdat
To write metadata to json, invoke the following. This is useful for reading the metadata programmatically.
readstat metadata /some/dir/to/example.sas7bdat --as-json
The JSON output contains file-level metadata and a vars object keyed by variable index. This makes it straightforward to search for a particular column by piping the output to jq or Python.
Search for a column with jq
# Find the variable entry whose var_name matches "Make"
readstat metadata /some/dir/to/example.sas7bdat --as-json \
| jq '.vars | to_entries[] | select(.value.var_name == "Make") | .value'
Search for a column with Python
# Find the variable entry whose var_name matches "Make"
readstat metadata /some/dir/to/example.sas7bdat --as-json \
| python -c "
import json, sys
md = json.load(sys.stdin)
match = [v for v in md['vars'].values() if v['var_name'] == 'Make']
if match:
print(json.dumps(match[0], indent=2))
"
Preview Data
To write parsed data (as a csv) to standard out, invoke the following (default is to write the first 10 rows).
readstat preview /some/dir/to/example.sas7bdat
To write the first 100 rows of parsed data (as a csv) to standard out, invoke the following.
readstat preview /some/dir/to/example.sas7bdat --rows 100
Data
π The data subcommand includes a parameter for --format, which is the file format that is to be written. Currently, the following formats have been implemented:
csvfeatherndjsonparquet
csv
To write parsed data (as csv) to a file, invoke the following (default is to write all parsed data to the specified file).
The default --format is csv. Thus, the parameter is elided from the below examples.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.csv
To write the first 100 rows of parsed data (as csv) to a file, invoke the following.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.csv --rows 100
feather
To write parsed data (as feather) to a file, invoke the following (default is to write all parsed data to the specified file).
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.feather --format feather
To write the first 100 rows of parsed data (as feather) to a file, invoke the following.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.feather --format feather --rows 100
ndjson
To write parsed data (as ndjson) to a file, invoke the following (default is to write all parsed data to the specified file).
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.ndjson --format ndjson
To write the first 100 rows of parsed data (as ndjson) to a file, invoke the following.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.ndjson --format ndjson --rows 100
parquet
To write parsed data (as parquet) to a file, invoke the following (default is to write all parsed data to the specified file).
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet
To write the first 100 rows of parsed data (as parquet) to a file, invoke the following.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --rows 100
To write parsed data (as parquet) to a file with specific compression settings, invoke the following:
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --compression zstd --compression-level 3
Column Selection
Select specific columns to include when converting or previewing data.
Step 1: View available columns
readstat metadata /some/dir/to/example.sas7bdat
Or as JSON for programmatic use with jq:
readstat metadata /some/dir/to/example.sas7bdat --as-json \
| jq '.vars | to_entries[] | .value.var_name'
Or with Python:
readstat metadata /some/dir/to/example.sas7bdat --as-json \
| python -c "
import json, sys
md = json.load(sys.stdin)
for v in md['vars'].values():
print(v['var_name'])
"
Step 2: Select columns on the command line
readstat data /some/dir/to/example.sas7bdat --output out.parquet --format parquet --columns Brand,Model,EngineSize
Step 2 (alt): Select columns from a file
Create columns.txt:
# Columns to extract from the dataset
Brand
Model
EngineSize
Then pass it to the CLI:
readstat data /some/dir/to/example.sas7bdat --output out.parquet --format parquet --columns-file columns.txt
Preview with column selection
readstat preview /some/dir/to/example.sas7bdat --columns Brand,Model,EngineSize
Parallelism
The data subcommand includes parameters for both parallel reading and parallel writing:
Parallel Reading (--parallel)
If invoked, the reading of a sas7bdat will occur in parallel. If the total rows to process is greater than stream-rows (if unset, the default rows to stream is 10,000), then each chunk of rows is read in parallel. Note that all processors on the userβs machine are used with the --parallel option. In the future, may consider allowing the user to throttle this number.
β Utilizing the --parallel parameter will increase memory usage β all chunks are read in parallel and collected in memory before being sent to the writer. In addition, because all processors are utilized, CPU usage may be maxed out during reading. Row ordering from the original sas7bdat is preserved.
Parallel Writing (--parallel-write)
When combined with --parallel, the --parallel-write flag enables parallel writing for Parquet format files. This can significantly improve write performance for large datasets by:
- Writing record batches to temporary files in parallel using all available processors
- Merging the temporary files into the final output
- Using spooled temporary files that keep data in memory until a threshold is reached
Note: Parallel writing currently only supports the Parquet format. Other formats (CSV, Feather, NDJSON) will use optimized sequential writes with BufWriter.
Example usage:
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --parallel --parallel-write
Memory Buffer Size (--parallel-write-buffer-mb)
Controls the memory buffer size (in MB) before spilling to disk during parallel writes. Defaults to 100 MB. Valid range: 1-10240 MB.
Smaller buffers will cause data to spill to disk sooner, while larger buffers keep more data in memory. Choose based on your available memory and dataset size:
- Small datasets (< 100 MB): Use default or larger buffer to keep everything in memory
- Large datasets (> 1 GB): Consider smaller buffer (10-50 MB) to manage memory usage
- Memory-constrained systems: Use smaller buffer (1-10 MB)
Example with custom buffer size:
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --parallel --parallel-write --parallel-write-buffer-mb 200
β Parallel writing may write batches out of order. This is acceptable for Parquet files as the row order is preserved when merged.
Memory Considerations
Default: Sequential Writes
In the default sequential write mode, a bounded channel (capacity 10) connects the reader thread to the writer. This means at most 10 chunks (each containing up to stream-rows rows) are held in memory at any time, providing natural backpressure when the writer is slower than the reader. For most workloads this keeps memory usage reasonable, but for very wide datasets (hundreds of columns, string-heavy) each chunk can be large β consider lowering --stream-rows if memory is a concern.
Sequential Write (default)
==========================
Reader Thread Bounded Channel (cap 10) Main Thread
+---------------------+ +------------------------+ +---------------------+
| | | | | |
| +-----------+ | send | +--+--+--+--+--+--+ | recv | +-------+ |
| | chunk 1 |-------|------>| | | | | | | | |------>| | write |---> file |
| +-----------+ | | +--+--+--+--+--+--+ | | +-------+ |
| +-----------+ | send | channel is full! | | |
| | chunk 2 |-------|------>| +--+--+--+--+--+--+--+| | +-------+ |
| +-----------+ | | | | | | | | | || | | write |---> file |
| +-----------+ | | +--+--+--+--+--+--+--+| | +-------+ |
| | chunk 3 |-------|-XXXXX | | | |
| +-----------+ | BLOCK | writer drains a slot | | +-------+ |
| ... waits ... | | +--+--+--+--+--+--+ | | | write |---> file |
| | chunk 3 |-------|------>| | | | | | | | | | +-------+ |
| +-----------+ | ok! | +--+--+--+--+--+--+ | | |
| | | | | |
+---------------------+ +------------------------+ +---------------------+
Memory at any moment: <= 10 chunks in the channel + 1 being written
Backpressure: reader blocks when channel is full
Parallel Writes (--parallel-write)
π --parallel-write: Uses bounded-batch processing β batches are pulled from the channel in groups (up to 10 at a time), written in parallel to temporary Parquet files, then the next group is pulled. This preserves the channelβs backpressure so that memory usage stays bounded rather than loading the entire dataset at once. All temporary files are merged into the final output at the end.
Parallel Write (--parallel --parallel-write)
============================================
Reader Thread Bounded Channel (cap 10) Main Thread
+------------------+ +------------------------+ +-------------------------+
| | | | | |
| +----------+ | send | | recv | Pull <= 10 batches |
| | chunk 1 |-----|------>| +-+-+-+-+-+-+-+-+-+-+ |------>| +----+----+----+----+ |
| +----------+ | | | | | | | | | | | | | | | | b1 | b2 | .. | bN | |
| +----------+ | send | +-+-+-+-+-+-+-+-+-+-+ | | +----+----+----+----+ |
| | chunk 2 |-----|------>| | | | | | |
| +----------+ | +------------------------+ | v v v |
| +----------+ | | Write in parallel |
| | chunk 3 |-----|----> ... | to temp .parquet files |
| +----------+ | | | | | |
| ... | | v v v |
| | | tmp_0 tmp_1 ... tmp_N |
| | +------------------------+ | |
| +----------+ | send | | recv | Pull next <= 10 |
| | chunk 11 |-----|------>| +-+-+-+-+-+-+-+-+-+-+ |------>| +----+----+----+----+ |
| +----------+ | | | | | | | | | | | | | | | |b11 |b12 | .. | bM | |
| +----------+ | send | +-+-+-+-+-+-+-+-+-+-+ | | +----+----+----+----+ |
| | chunk 12 |-----|------>| | | | | | |
| +----------+ | +------------------------+ | v v v |
| ... | | tmp_N+1 ... tmp_M |
+------------------+ | |
| ... repeat until done |
+-------------------------+
|
+----------------------------------------+
|
v
+-------------------+ +--------------------+
| Merge all temp | | |
| .parquet files |------>| final output.pqt |
| in order | | |
+-------------------+ +--------------------+
Memory at any moment: <= 10 chunks in channel + 10 being written
Backpressure: preserved -- reader blocks while a batch group is being written
SQL Queries (--sql)
β οΈ --sql (feature-gated): SQL queries require the full dataset to be materialized in memory via DataFusionβs MemTable before query execution. For large files this may result in significant memory usage. Queries that filter rows (e.g. SELECT ... WHERE ...) will reduce the output size but the input must still be fully loaded.
SQL Query Mode (--sql "SELECT ...")
===================================
Reader Thread Bounded Channel Main Thread
+------------------+ +---------------+ +---------------------------+
| | | | | |
| +----------+ | send | | recv | Collect ALL batches |
| | chunk 1 |-----|------>| |------>| into memory (required |
| +----------+ | | | | by DataFusion MemTable) |
| +----------+ | send | | | |
| | chunk 2 |-----|------>| |------>| +-----+-----+-----+ |
| +----------+ | | | | | b1 | b2 | ... | |
| ... | | | | +-----+-----+-----+ |
| +----------+ | send | | | | |
| | chunk N |-----|------>| |------>| v |
| +----------+ | | | | +-------------+ |
+------------------+ +---------------+ | | DataFusion | |
| | SQL Engine | |
| +-------------+ |
| | |
| v |
| Write filtered results |
| to output file |
+---------------------------+
Memory at peak: ALL chunks in memory (no backpressure)
This is inherent to SQL execution over in-memory tables.
Reading Metadata from Output Files
When converting to Parquet or Feather, readstat-rs preserves column metadata (labels, SAS format strings, and storage widths) as Arrow field metadata. Schema-level metadata includes the table label when present.
The following metadata keys may appear on each field:
| Key | Description | Condition |
|---|---|---|
label | User-assigned variable label | Non-empty |
sas_format | SAS format string (e.g. DATE9, BEST12, $30) | Non-empty |
storage_width | Number of bytes used to store the variable | Always |
display_width | Display width hint from the file | Non-zero |
Schema-level metadata:
| Key | Description | Condition |
|---|---|---|
table_label | User-assigned file label | Non-empty |
Reading metadata with Python (pyarrow)
import pyarrow.parquet as pq
schema = pq.read_schema("example.parquet")
# Table-level metadata
print(schema.metadata.get(b"table_label", b"").decode())
# Per-column metadata
for field in schema:
meta = field.metadata or {}
print(f"{field.name}:")
print(f" label: {meta.get(b'label', b'').decode()}")
print(f" sas_format: {meta.get(b'sas_format', b'').decode()}")
print(f" storage_width: {meta.get(b'storage_width', b'').decode()}")
print(f" display_width: {meta.get(b'display_width', b'').decode()}")
Reading metadata with R (arrow)
library(arrow)
schema <- read_parquet("example.parquet", as_data_frame = FALSE)$schema
# Per-column metadata
for (field in schema) {
cat(field$name, "\n")
cat(" label: ", field$metadata$label, "\n")
cat(" sas_format: ", field$metadata$sas_format, "\n")
cat(" storage_width:", field$metadata$storage_width, "\n")
cat(" display_width:", field$metadata$display_width, "\n")
}
Reader
The preview and data subcommands include a parameter for --reader. The possible values for --reader include the following.
memβ Parse and read the entiresas7bdatinto memory before writing to either standard out or a filestream(default) β Parse and read at moststream-rowsinto memory before writing to diskstream-rowsmay be set via the command line parameter--stream-rowsor if elided will default to 10,000 rows
Why is this useful?
memis useful for testing purposesstreamis useful for keeping memory usage low for large datasets (and hence is the default)- In general, users should not need to deviate from the default β
streamβ unless they have a specific need - In addition, by enabling these options as command line parameters hyperfine may be used to benchmark across an assortment of file sizes
Debug
Debug information is printed to standard out by setting the environment variable RUST_LOG=debug before the call to readstat.
β οΈ This is quite verbose! If using the preview or data subcommand, will write debug information for every single value!
# Linux and macOS
RUST_LOG=debug readstat ...
# Windows PowerShell
$env:RUST_LOG="debug"; readstat ...
Help
For full details run with --help.
readstat --help
readstat metadata --help
readstat preview --help
readstat data --help
Architecture
Rust CLI tool and library that reads SAS binary files (.sas7bdat) and converts them to other formats (CSV, Feather, NDJSON, Parquet). Uses FFI bindings to the ReadStat C library for parsing, and Apache Arrow for in-memory representation and output.
Scope: The readstat-sys crate exposes the full ReadStat C API, which supports SAS (.sas7bdat, .xpt), SPSS (.sav, .zsav, .por), and Stata (.dta). However, the readstat, readstat-cli, and readstat-wasm crates currently only implement parsing and conversion for SAS .sas7bdat files.
Workspace Layout
readstat-rs/
βββ Cargo.toml # Workspace root (edition 2024, resolver 2)
βββ crates/
β βββ readstat/ # Library crate (parse SAS β Arrow, optional format writers)
β βββ readstat-cli/ # Binary crate (CLI arg parsing, orchestration)
β βββ readstat-sys/ # FFI bindings to ReadStat C library (bindgen)
β βββ readstat-iconv-sys/ # FFI bindings to iconv (Windows only)
β βββ readstat-tests/ # Integration test suite
β βββ readstat-wasm/ # WebAssembly build (excluded from workspace)
βββ examples/
β βββ cli-demo/ # CLI conversion demo
β βββ api-demo/ # REST API servers (Rust + Python)
β βββ bun-demo/ # WASM usage from Bun/JS
β βββ web-demo/ # Browser-based viewer and converter
β βββ sql-explorer/ # Browser-based SQL explorer (AlaSQL + WASM)
βββ docs/
Crate Details
readstat (v0.20.0) β Library Crate
Path: crates/readstat/
Pure library for parsing SAS binary files into Arrow RecordBatch format. Output format writers (CSV, Feather, NDJSON, Parquet) are feature-gated.
Features: csv, feather, ndjson, parquet (all enabled by default), sql.
Key source modules in crates/readstat/src/:
| Module | Purpose |
|---|---|
lib.rs | Public API exports |
cb.rs | C callback functions for ReadStat (handle_metadata, handle_variable, handle_value) |
rs_data.rs | Data reading, Arrow RecordBatch conversion |
rs_metadata.rs | Metadata extraction, Arrow schema building |
rs_parser.rs | ReadStatParser wrapper around C parser |
rs_path.rs | Input path validation |
rs_write_config.rs | Output configuration (path, format, compression) |
rs_var.rs | Variable types and value handling |
rs_write.rs | Output writers (CSV, Feather, NDJSON, Parquet) |
progress.rs | ProgressCallback trait for parsing progress reporting |
rs_query.rs | SQL query execution via DataFusion (feature-gated) |
formats.rs | SAS format detection (118 date/time/datetime formats, regex-based) |
err.rs | Error enum (41 variants mapping to C library errors) |
common.rs | Utility functions |
rs_buffer_io.rs | Buffer I/O operations |
Key public types:
ReadStatDataβ coordinates FFI parsing, accumulates values directly into typed Arrow builders, produces Arrow RecordBatchReadStatMetadataβ file-level metadata (row/var counts, encoding, compression, schema)ColumnBuilderβ enum wrapping 12 typed Arrow builders (StringBuilder, Float64Builder, Date32Builder, etc.); values are appended during FFI callbacks with zero intermediate allocationReadStatWriterβ writes output in requested formatReadStatPathβ validated input file pathWriteConfigβ output configuration (path, format, compression)OutFormatβ output format enum (Csv, Feather, Ndjson, Parquet)ProgressCallbackβ trait for receiving progress updates during parsing
Major dependencies: Arrow v57 ecosystem, Parquet (5 compression codecs, optional), Rayon, chrono, memmap2.
readstat-cli (v0.20.0) β CLI Binary
Path: crates/readstat-cli/
Binary crate producing the readstat CLI tool. Uses clap with three subcommands:
metadataβ print file metadata (row/var counts, labels, encoding, etc.)previewβ preview first N rowsdataβ convert to output format (csv, feather, ndjson, parquet)
Owns CLI arg parsing, progress bars, colored output, and reader-writer thread orchestration.
Additional dependencies: clap v4, colored, indicatif, crossbeam, env_logger, path_abs.
readstat-sys (v0.3.0) β FFI Bindings
Path: crates/readstat-sys/
build.rs compiles ~49 C source files from vendor/ReadStat/ git submodule via the cc crate, then generates Rust bindings with bindgen. Exposes the full ReadStat API including support for SAS, SPSS, and Stata formats. Platform-specific linking for iconv and zlib:
| Platform | iconv | zlib | Notes |
|---|---|---|---|
Windows (windows-msvc) | Static β compiled from vendored readstat-iconv-sys submodule | Static β compiled via libz-sys crate | readstat-iconv-sys is a cfg(windows) dependency; needs LIBCLANG_PATH |
macOS (apple-darwin) | Dynamic β system libiconv | libz-sys (uses system zlib) | iconv linked via cargo:rustc-link-lib=iconv |
| Linux (gnu/musl) | Dynamic β system library | libz-sys (prefers system, falls back to source) | No explicit iconv link directives; system linker resolves automatically |
Header include paths are propagated between crates using Cargoβs links key:
readstat-iconv-syssetscargo:include=...which becomesDEP_ICONV_INCLUDEinreadstat-syslibz-syssetscargo:include=...which becomesDEP_Z_INCLUDEinreadstat-sys
readstat-iconv-sys (v0.3.0) β iconv FFI (Windows)
Path: crates/readstat-iconv-sys/
Windows-only (#[cfg(windows)]). Compiles libiconv from the vendor/libiconv-win-build/ git submodule using the cc crate, producing a static library. On non-Windows platforms the build script is a no-op. The links = "iconv" key in Cargo.toml allows readstat-sys to discover the include path via the DEP_ICONV_INCLUDE environment variable.
readstat-wasm (v0.1.0) β WebAssembly Build
Path: crates/readstat-wasm/
WebAssembly build of the readstat library for parsing SAS .sas7bdat files in JavaScript. Compiles the ReadStat C library and the Rust readstat library to WebAssembly via the wasm32-unknown-emscripten target. Excluded from the Cargo workspace (built separately with Emscripten).
Exports: read_metadata, read_metadata_fast, read_data (CSV), read_data_ndjson, read_data_parquet, read_data_feather, free_string, free_binary. Not published to crates.io (publish = false).
readstat-tests β Integration Tests
Path: crates/readstat-tests/
29 test modules covering: all SAS data types, 118 date/time/datetime formats, missing values, malformed UTF-8, large pages, CLI subcommands, parallel read/write, Parquet output, CSV output, Arrow migration, row offsets, scientific notation, column selection, skip row count, memory-mapped file reading, byte-slice reading, and SQL queries. Every sas7bdat file in the test data directory has both metadata and data reading tests.
Test data lives in tests/data/*.sas7bdat (14 datasets). SAS scripts to regenerate test data are in util/.
| Dataset | Metadata Test | Data Test |
|---|---|---|
all_dates.sas7bdat | β | β |
all_datetimes.sas7bdat | β | β |
all_times.sas7bdat | β | β |
all_types.sas7bdat | β | β |
cars.sas7bdat | β | β |
hasmissing.sas7bdat | β | β |
intel.sas7bdat | β | β |
malformed_utf8.sas7bdat | β | β |
messydata.sas7bdat | β | β |
rand_ds_largepage_err.sas7bdat | β | β |
rand_ds_largepage_ok.sas7bdat | β | β |
scientific_notation.sas7bdat | β | β |
somedata.sas7bdat | β | β |
somemiss.sas7bdat | β | β |
Build Prerequisites
- Rust (edition 2024)
- libclang (for bindgen)
- Git submodules must be initialized (
git submodule update --init --recursive) - On Windows: MSVC toolchain
Key Architectural Patterns
- FFI callback pattern: ReadStat C library calls Rust callbacks (
cb.rs) during parsing; data accumulates inReadStatDatavia raw pointer casts - Streaming: default reader streams rows in chunks (10k) to manage memory
- Parallel processing: Rayon for parallel reading, Crossbeam channels for reader-writer coordination
- Column filtering: optional
--columns/--columns-fileflags restrict parsing to selected variables; unselected values are skipped in thehandle_valuecallback while row-boundary detection uses the original (unfiltered) variable count - Arrow pipeline: SAS data β typed Arrow builders (direct append in FFI callbacks) β Arrow RecordBatch β output format
- Multiple I/O strategies: file path (default), memory-mapped files (
memmap2), and in-memory byte slices β all feed into the same FFI parsing pipeline - Metadata preservation: SAS variable labels, format strings, and storage widths are persisted as Arrow field metadata, surviving round-trips through Parquet and Feather. See TECHNICAL.md for details.
Technical Details
Floating Point Values
β οΈ Decimal values are rounded to contain only 14 decimal digits!
For example, the number 1.1234567890123456 created within SAS would be returned as 1.12345678901235 within Rust.
Why does this happen? Is this an implementation error? No, rounding to only 14 decimal digits has been purposely implemented within the Rust code.
As a specific example, when testing with the cars.sas7bdat dataset (which was created originally on Windows), the numeric value 4.6 as observed within SAS was being returned as 4.600000000000001 (15 digits) within Rust. Values created on Windows with an x64 processor are only accurate to 15 digits.
For comparison, the ReadStat binary truncates to 14 decimal places when writing to csv.
Finally, SAS represents all numeric values in floating-point representation which creates a challenge for all parsed numerics!
Implementation: pure-arithmetic rounding
Rounding is performed using pure f64 arithmetic in cb.rs, avoiding any string formatting or heap allocation:
#![allow(unused)]
fn main() {
const ROUND_SCALE: f64 = 1e14;
fn round_decimal_f64(v: f64) -> f64 {
if !v.is_finite() { return v; }
let int_part = v.trunc();
let frac_part = v.fract();
let rounded_frac = (frac_part * ROUND_SCALE).round() / ROUND_SCALE;
int_part + rounded_frac
}
}
The value is split into integer and fractional parts before scaling. This is necessary because large SAS datetime values (~1.9e9) multiplied directly by 1e14 would exceed f64βs exact integer range (2^53), causing precision loss. Since fract() is always in (-1, 1), fract() * 1e14 < 1e14 < 2^53, keeping the scaled value within the exact-integer range.
Why this is equivalent to the previous string roundtrip (format!("{:.14}") + lexical::parse): both approaches produce the nearest representable f64 to the value rounded to 14 decimal places. The tie-breaking rule (half-away-from-zero for .round() vs half-to-even for format!) is never exercised because every f64 is a dyadic rational (m / 2^k), and a true decimal midpoint would require an odd factor of 5 in the denominator β which is impossible for any f64 value.
Sources
- How SAS Stores Numeric Values
- Accuracy on x64 Windows Processors
- SAS on Windows with x64 processors can only represent 15 digits
- Floating-point arithmetic may give inaccurate results in Excel
- What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg, 1991)
Date, Time, and Datetimes
All 118 SAS date, time, and datetime formats are recognized and parsed appropriately. For the full list of supported formats, see sas_date_time_formats.md.
β οΈ If the format does not match a recognized SAS date, time, or datetime format, or if the value does not have a format applied, then the value will be parsed and read as a numeric value!
Details
SAS stores dates, times, and datetimes internally as numeric values. To distinguish among dates, times, datetimes, or numeric values, a SAS format is read from the variable metadata. If the format matches a recognized SAS date, time, or datetime format then the numeric value is converted and read into memory using one of the Arrow types:
- Date32Type
- Time32SecondType
- Time64MicrosecondType β for time formats with microsecond precision (e.g.
TIME15.6, decimal places 4β6) - TimestampSecondType
- TimestampMillisecondType β for datetime formats with millisecond precision (e.g.
DATETIME22.3, decimal places 1β3) - TimestampMicrosecondType β for datetime formats with microsecond precision (e.g.
DATETIME22.6, decimal places 4β6) - TimestampNanosecondType β for datetime formats with nanosecond precision (e.g.
DATETIME22.9, decimal places 7β9)
If values are read into memory as Arrow date, time, or datetime types, then when they are written β from an Arrow RecordBatch to csv, feather, ndjson, or parquet β they are treated as dates, times, or datetimes and not as numeric values.
Column Metadata in Arrow and Parquet
When converting to Parquet or Feather, readstat-rs persists column-level and table-level metadata into the Arrow schema. This metadata survives round-trips through Parquet and Feather files, allowing downstream consumers to recover SAS-specific information.
Metadata keys
Field (column) metadata
| Key | Type | Description | Source formats |
|---|---|---|---|
label | string | User-assigned variable label | SAS, SPSS, Stata |
sas_format | string | SAS format string (e.g. DATE9, BEST12, $30) | SAS |
storage_width | integer (as string) | Number of bytes used to store the variable value | All |
display_width | integer (as string) | Display width hint from the file | XPORT, SPSS |
Schema (table) metadata
| Key | Type | Description |
|---|---|---|
table_label | string | User-assigned file label |
Storage width semantics
- SAS numeric variables: always 8 bytes (IEEE 754 double-precision)
- SAS string variables: equal to the declared character length (e.g.
$30β 30 bytes) - The
storage_widthfield is always present in metadata
Display width semantics
- sas7bdat files: typically 0 (not stored in the format)
- XPORT files: populated from the format width
- SPSS files: populated from the variableβs print/write format
- The
display_widthfield is only present in metadata when non-zero
SAS format strings and Arrow types
The SAS format string (e.g. DATE9, DATETIME22.3, TIME8) determines how a numeric variable is mapped to an Arrow type. The original format string is preserved in the sas_format metadata key, allowing downstream tools to reconstruct the original SAS formatting even after conversion.
For the full list of recognized SAS date, time, and datetime formats, see sas_date_time_formats.md.
Reading metadata from output files
See the Reading Metadata from Output Files section in the Usage guide for Python and R examples.
Testing
To perform unit / integration tests, run the following.
cargo test --workspace
To run only integration tests:
cargo test -p readstat-tests
Datasets
Formally tested (via integration tests) against the following datasets. See the README.md for data sources.
-
ahs2019n.sas7bdatβ US Census data (download via download_ahs.sh or download_ahs.ps1) -
all_dates.sas7bdatβ SAS dataset containing all possible date formats -
all_datetimes.sas7bdatβ SAS dataset containing all possible datetime formats -
all_times.sas7bdatβ SAS dataset containing all possible time formats -
all_types.sas7bdatβ SAS dataset containing all SAS types -
cars.sas7bdatβ SAS cars dataset -
hasmissing.sas7bdatβ SAS dataset containing missing values -
intel.sas7bdat -
malformed_utf8.sas7bdatβ SAS dataset with truncated multi-byte UTF-8 characters (issue #78) -
messydata.sas7bdat -
rand_ds_largepage_err.sas7bdatβ Created using create_rand_ds.sas with BUFSIZE set to2M -
rand_ds_largepage_ok.sas7bdatβ Created using create_rand_ds.sas with BUFSIZE set to1M -
scientific_notation.sas7bdatβ Used to test float parsing -
somedata.sas7bdatβ Used to test Parquet label preservation -
somemiss.sas7bdat
Valgrind
To ensure no memory leaks, valgrind may be utilized. For example, to ensure no memory leaks for the test parse_file_metadata_test, run the following from within the readstat directory.
valgrind ./target/debug/deps/parse_file_metadata_test-<hash>
Memory Safety
This project contains unsafe Rust code (FFI callbacks, pointer casts, memory-mapped I/O) and links against the vendored ReadStat C library. Four automated CI checks guard against memory errors.
CI Jobs
All four jobs run on every workflow dispatch and tag push, in parallel with the build jobs. Any memory error fails the job with a nonzero exit code.
Miri (Rust undefined behavior)
- Platform: Ubuntu (Linux)
- Scope: Unit tests in the
readstatcrate only (cargo miri test -p readstat) - What it catches: Undefined behavior in pure-Rust unsafe code β invalid pointer arithmetic, uninitialized reads, provenance violations, use-after-free in Rust allocations
- Limitation: Cannot execute FFI calls into C code, so integration tests (
readstat-tests) are excluded
Configuration:
- Uses Rust nightly with the
miricomponent MIRIFLAGS="-Zmiri-disable-isolation"allows tests that usetempfileto create directories
AddressSanitizer β Linux
- Platform: Ubuntu (Linux)
- Scope: Full workspace β lib tests, integration tests, binary tests (
cargo test --workspace --lib --tests --bins) - What it catches: Heap/stack buffer overflows, use-after-free, double-free, memory leaks (LeakSanitizer is enabled by default on Linux), across both Rust and C code
Configuration:
RUSTFLAGS="-Zsanitizer=address -Clinker=clang"β instruments Rust code and links the ASan runtime via clangREADSTAT_SANITIZE_ADDRESS=1β triggersreadstat-sys/build.rsto compile the ReadStat C library with-fsanitize=address -fno-omit-frame-pointer- Doctests are excluded (
--lib --tests --bins) becauserustdocdoes not properly inherit sanitizer linker flags
AddressSanitizer β macOS
- Platform: macOS (arm64)
- Scope: Full workspace β lib tests, integration tests, binary tests
- What it catches: Buffer overflows, use-after-free, double-free in Rust code and at the FFI boundary
Configuration:
RUSTFLAGS="-Zsanitizer=address"β instruments Rust code only- The ReadStat C library is not instrumented on macOS because Apple Clang and Rustβs LLVM have incompatible ASan runtimes β see ASan Runtime Mismatch below
- LeakSanitizer is not supported on macOS
- Doctests excluded for the same reason as Linux
AddressSanitizer β Windows
- Platform: Windows (x86_64, MSVC toolchain)
- Scope: Full workspace β lib tests, integration tests, binary tests
- What it catches: Buffer overflows, use-after-free, double-free in Rust code and at the FFI boundary
Configuration:
RUSTFLAGS="-Zsanitizer=address"β instruments Rust code only- Rust on Windows MSVC uses Microsoftβs ASan runtime (from Visual Studio), not LLVMβs compiler-rt. The compiler passes
/INFERASANLIBSto the MSVC linker, which auto-discovers the runtime import library at link time. See PR #118521. - Important: the MSVC ASan runtime DLL (
clang_rt.asan_dynamic-x86_64.dll) is NOT on PATH by default. The linker finds the import library at build time via/INFERASANLIBS, but the DLL loader needs the DLL on PATH at test runtime. The CI job usesvswhere.exeto locate the DLL directory (e.g.,C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\<ver>\bin\Hostx64\x64\) and prepends it to PATH. - LLVM is installed only for
libclang(required by bindgen), pinned to the same version as the regular Windows build job. It is not used for the ASan runtime. - The ReadStat C library is not instrumented on Windows currently. Unlike macOS, there is no runtime mismatch β both Rust and
cl.exeuse the same MSVC ASan runtime. Full C instrumentation is a future improvement (see Future Work). - LeakSanitizer is not supported on Windows
- Doctests excluded for the same reason as Linux
How READSTAT_SANITIZE_ADDRESS Works
The readstat-sys/build.rs build script checks for the READSTAT_SANITIZE_ADDRESS environment variable. When set, it adds sanitizer flags to the C compiler flags for the ReadStat library only. This is intentionally scoped β a global CFLAGS would instrument third-party sys crates (e.g., zstd-sys) causing linker failures.
The flags are platform-specific:
- Linux/macOS:
-fsanitize=address -fno-omit-frame-pointer(GCC/Clang syntax) - Windows MSVC:
/fsanitize=address(MSVC syntax)
Currently only the Linux CI job sets READSTAT_SANITIZE_ADDRESS=1 because it is the only platform where C instrumentation has been validated.
ASan Runtime Mismatch (macOS)
macOS has an ASan runtime mismatch that prevents instrumenting the C code alongside Rust. Apple Clang is a fork of LLVM with its own ASan runtime versioning. When both Rust and the C library are instrumented, the linker sees two incompatible ASan runtimes and fails with ___asan_version_mismatch_check_apple_clang_* vs ___asan_version_mismatch_check_v8. A potential workaround is to install upstream LLVM via Homebrew (brew install llvm) and set CC=/opt/homebrew/opt/llvm/bin/clang so both the C code and Rust use the same LLVM ASan runtime. However, this is fragile β the Homebrew LLVM version must stay close to the LLVM version used by Rust nightly, which changes frequently.
Windows does NOT have this problem. Rust on x86_64-pc-windows-msvc uses Microsoftβs ASan runtime (PR #118521), and so does cl.exe /fsanitize=address. Both link the same clang_rt.asan_dynamic-x86_64.dll from Visual Studio. Full C + Rust ASan instrumentation is theoretically possible on Windows β see Future Work.
Bottom line: Linux has full C + Rust ASan coverage. macOS provides Rust-only coverage due to the Apple Clang runtime mismatch. Windows provides Rust-only coverage currently, but full coverage is a future improvement since there is no runtime mismatch.
Future Work: Windows C Instrumentation
Since Rust and MSVC share the same ASan runtime on Windows, enabling READSTAT_SANITIZE_ADDRESS=1 in the Windows CI job should allow full C + Rust instrumentation β matching Linuxβs coverage. This requires:
- Setting
READSTAT_SANITIZE_ADDRESS=1soreadstat-sys/build.rsadds/fsanitize=addresswhen compiling the ReadStat C library - Verifying there are no linker conflicts (if conflicts arise, the unstable
-Zexternal-clangrtflag can tell Rust to skip linking its own runtime copy) - Ensuring the MSVC ASan runtime DLL is on PATH at test time (the CI job already does this via
vswhere.exe)
Running Locally
Miri
rustup +nightly component add miri
MIRIFLAGS="-Zmiri-disable-isolation" cargo +nightly miri test -p readstat
ASan on Linux
RUSTFLAGS="-Zsanitizer=address -Clinker=clang" \
READSTAT_SANITIZE_ADDRESS=1 \
cargo +nightly test --workspace --lib --tests --bins --target x86_64-unknown-linux-gnu
ASan on macOS
RUSTFLAGS="-Zsanitizer=address" \
cargo +nightly test --workspace --lib --tests --bins --target aarch64-apple-darwin
ASan on Windows
$env:RUSTFLAGS = "-Zsanitizer=address"
# The MSVC ASAN runtime DLL must be on PATH. Find it via vswhere:
$vsPath = & "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe" -latest -property installationPath
$msvcVer = (Get-ChildItem "$vsPath\VC\Tools\MSVC" | Sort-Object Name -Descending | Select-Object -First 1).Name
$env:PATH = "$vsPath\VC\Tools\MSVC\$msvcVer\bin\Hostx64\x64;$env:PATH"
cargo +nightly test --workspace --lib --tests --bins --target x86_64-pc-windows-msvc
Valgrind (Linux)
For manual checks with full C library coverage, valgrind can also be used against debug test binaries:
cargo test -p readstat-tests --no-run
valgrind ./target/debug/deps/parse_file_metadata_test-<hash>
Coverage Summary
| Tool | Platform | Rust code | C code (ReadStat) | Leak detection |
|---|---|---|---|---|
| Miri | Linux | Unit tests only | No (FFI excluded) | No |
| ASan | Linux | Full workspace | Yes (instrumented) | Yes |
| ASan | macOS | Full workspace | No (runtime mismatch) | No |
| ASan | Windows | Full workspace | Not yet (no mismatch β see future work) | No |
| Valgrind | Linux (manual) | Full | Full | Yes |
Performance Benchmarking with Criterion
Overview
This document provides a comprehensive guide to performance benchmarking in readstat-rs using Criterion.rs.
Quick Start
# Run all benchmarks
cd crates/readstat
cargo bench
# View HTML reports
open target/criterion/report/index.html
What Gets Benchmarked
1. Reading Performance
- Metadata Reading (
~300-950 Β΅s) - File header parsing - Single Chunk Reading - Full dataset read performance
- Chunked Reading - Streaming with different chunk sizes (1K, 5K, 10K rows)
2. Data Conversion
- Arrow Conversion - SAS types β Arrow RecordBatch overhead
3. Writing Performance
- CSV Writing - Text format output
- Parquet Compression - Uncompressed, Snappy, Zstd comparison
- Format Comparison - CSV vs Parquet vs Feather vs NDJSON
4. Parallel Write Optimization
- Buffer Sizes - SpooledTempFile memory thresholds (1MB, 10MB, 100MB, 500MB)
5. End-to-End Pipeline
- Complete Conversion - Read + Write combined (most important)
Sample Results
From initial benchmark run (example output):
metadata_reading/all_types.sas7bdat
time: [299.41 Β΅s 301.84 Β΅s 304.29 Β΅s]
metadata_reading/cars.sas7bdat
time: [935.21 Β΅s 943.52 Β΅s 952.41 Β΅s]
read_single_chunk/cars.sas7bdat
time: [~2-3 ms]
thrpt: [~150-200K rows/sec]
write_parquet_compression/snappy
time: [~4-6 ms]
thrpt: [~70-100K rows/sec]
end_to_end_conversion/parquet
time: [~6-9 ms]
thrpt: [~50-70K rows/sec]
Interpreting Results
Understanding the Output
Time Measurement:
time: [299.41 Β΅s 301.84 Β΅s 304.29 Β΅s]
^ ^ ^
| | +-- Upper bound (95% confidence)
| +------------ Median
+---------------------- Lower bound (95% confidence)
Throughput:
thrpt: [150K elem/s 175K elem/s 200K elem/s]
^ ^ ^
| | +-- Upper bound
| +-------------- Median
+-------------------------- Lower bound
Change Detection:
change: [-2.3456% -1.2345% +0.1234%] (p = 0.12 > 0.05)
^ ^ ^ ^
| | | +-- Statistical significance
| | +----------- Upper bound of change
| +--------------------- Median change
+------------------------------- Lower bound of change
What to Look For
π΄ Red Flags (Investigate)
- High variance (>10%) - Results unreliable
- Significant regression (>5% slower, p < 0.05)
- Outliers (>5% of samples)
π‘ Opportunities
- Chunked reading - Test if different chunk size improves throughput
- Buffer sizes - If small buffer performs as well as large, save memory
- Compression - If uncompressed only slightly faster, use compression
π’ Validation
- Low variance (<5%) - Reliable results
- Improvements (>10% faster, p < 0.05)
- Expected patterns (e.g., compression should be slower but smaller)
Performance Optimization Workflow
Step 1: Establish Baseline
# Save current performance as baseline
cargo bench --save-baseline main
# Results saved to target/criterion/{benchmark}/main/
Step 2: Make Changes
Edit code with optimization hypothesis:
- Increase buffer size
- Change algorithm
- Add caching
- Parallel processing
Step 3: Measure Impact
# Compare against baseline
cargo bench --baseline main
# Look for "change: [X% Y% Z%]" in output
Step 4: Analyze & Iterate
If improved (>10%, p < 0.05):
β
Keep the change
β
Update baseline: cargo bench --save-baseline main
If no change (<5%): β οΈ Optimization didnβt help - profile to find real bottleneck
If regressed (slower): β Revert change β Investigate why performance decreased
Common Optimization Scenarios
Scenario 1: Slow Reading
Symptoms: read_single_chunk time is high
Investigate:
- ReadStat C library overhead (FFI calls)
- Memory allocation patterns
- Callback overhead
Try:
- Larger buffers in C library
- Memory-mapped files (see evaluation doc)
- Pre-allocate column vectors
Scenario 2: Slow Writing
Symptoms: write_formats time is high
Investigate:
- BufWriter buffer size
- Format-specific overhead
- Compression CPU usage
Try:
- Increase BufWriter capacity (currently 8KB)
- Use faster compression (Snappy vs Zstd)
- Parallel writing (already implemented)
Scenario 3: Memory Issues
Symptoms: System swapping, OOM errors
Investigate:
- Chunk size too large
- Too many parallel streams
- Memory leaks
Try:
- Reduce
stream_rows(default 10,000) - Reduce parallel write buffer (default 100MB)
- Use bounded channels (already implemented)
Scenario 4: High Variance
Symptoms: Large confidence intervals, many outliers
Investigate:
- System background activity
- CPU frequency scaling
- Thermal throttling
Try:
- Close background apps
- Disable frequency scaling
- Run on consistent power mode
Advanced Profiling
CPU Profiling with Flamegraphs
# Install flamegraph
cargo install flamegraph
# Profile a specific benchmark
cargo flamegraph --bench readstat_benchmarks -- --bench read_single_chunk
# Open flamegraph.svg to see hotspots
What to look for:
- Wide bars = lots of time spent
- Deep stacks = call overhead
- Unexpected functions = bugs/inefficiency
Memory Profiling
# Using valgrind (Linux)
valgrind --tool=massif \
cargo bench read_single_chunk --no-run
ms_print massif.out.* > memory_profile.txt
# Using heaptrack (Linux)
heaptrack cargo bench read_single_chunk
heaptrack_gui heaptrack.*.gz
System Call Tracing
# Linux: strace
strace -c cargo bench read_single_chunk 2>&1 | tail -20
# macOS: dtruss
sudo dtruss -c cargo bench read_single_chunk
Comparing Implementations
Before/After Memory-Mapped Files
# Baseline without mmap
git checkout main
cargo bench --save-baseline without-mmap
# With mmap implementation
git checkout feature/mmap
cargo bench --baseline without-mmap
# Look for improvements in read_single_chunk
Parallel vs Sequential
# Test with different parallelism settings
cargo bench end_to_end -- --parallel
cargo bench end_to_end -- --sequential
CI/CD Integration
Performance Regression Detection
Add to .github/workflows/benchmarks.yml:
name: Performance Benchmarks
on:
pull_request:
branches: [main]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Rust
uses: dtolnay/rust-toolchain@stable
- name: Run benchmarks
run: |
cd crates/readstat
cargo bench --no-run # Just compile for CI
- name: Compare with baseline (on main branch)
if: github.event_name == 'pull_request'
run: |
git fetch origin main:main
git checkout main
cargo bench --save-baseline main
git checkout -
cargo bench --baseline main
Best Practices
Doβs β
- Run benchmarks on consistent hardware
- Close background applications
- Use
--save-baselinefor comparisons - Profile after benchmarking to find bottlenecks
- Document performance changes in PRs
- Test on representative data sizes
Donβts β
- Donβt benchmark on laptop (throttling)
- Donβt optimize without profiling first
- Donβt trust results with high variance
- Donβt compare across different systems
- Donβt commit benchmark artifacts
- Donβt skip statistical significance checks
Performance Goals
Current Performance (Baseline)
- Metadata reading: ~300-950 Β΅s
- Read throughput: ~150-200K rows/sec
- Write throughput: ~70-100K rows/sec
- End-to-end: ~50-70K rows/sec
Target Performance (Goals)
- Metadata reading: <500 Β΅s (β30%)
- Read throughput: >250K rows/sec (β25%)
- Write throughput: >100K rows/sec (β30%)
- End-to-end: >100K rows/sec (β40%)
Stretch Goals
- Memory-mapped reads: 2x faster for large files
- Parallel writes: 3-4x speedup with 4+ cores
- Compression: <10% overhead for Snappy
Data Files for Benchmarking
Current Test Data
- all_types.sas7bdat - 3 rows, 10 vars (tiny)
- cars.sas7bdat - 1081 rows, 13 vars (small)
Recommended Additional Data
For comprehensive benchmarking, consider adding:
Small (good for quick iteration):
- < 1 MB file size
- < 1,000 rows
- 5-10 variables
Medium (typical use case):
- 10-100 MB file size
- 10,000-100,000 rows
- 10-50 variables
Large (stress test):
-
1 GB file size
-
1,000,000 rows
- 50+ variables
Resources
Documentation
Tools
- cargo-flamegraph
- cargo-benchcmp
- hyperfine - CLI benchmarking (see below)
Blog Posts
Next Steps
- Run full benchmark suite:
cargo bench - Review HTML reports: Open
target/criterion/report/index.html - Identify bottlenecks: Look for slowest operations
- Profile with flamegraph: Focus on hotspots
- Implement optimizations: Test one at a time
- Validate improvements: Compare against baseline
- Document findings: Update this file with results
Questions?
- See detailed README:
crates/readstat/benches/README.md - Check Criterion docs: https://bheisler.github.io/criterion.rs/book/
- Review performance evaluation: Memory-mapped files analysis (separate doc)
Benchmarking with hyperfine
Benchmarking performed with hyperfine.
This example compares the performance of the Rust binary with the performance of the C binary built from the ReadStat repository. In general, hope that performance is fairly close to that of the C binary.
To run, execute the following from within the readstat directory.
# Windows
hyperfine --warmup 5 "ReadStat_App.exe -f crates\readstat-tests\tests\data\cars.sas7bdat tests\data\cars_c.csv" ".\target\release\readstat.exe data crates\readstat-tests\tests\data\cars.sas7bdat --output crates\readstat-tests\tests\data\cars_rust.csv"
π First experiments on Windows are challenging to interpret due to file caching. Need further research into utilizing the --prepare option provided by hyperfine on Windows.
# Linux and macOS
hyperfine --prepare "sync; echo 3 | sudo tee /proc/sys/vm/drop_caches" "readstat -f crates/readstat-tests/tests/data/cars.sas7bdat crates/readstat-tests/tests/data/cars_c.csv" "./target/release/readstat data tests/data/cars.sas7bdat --output crates/readstat-tests/tests/data/cars_rust.csv"
Other, future, benchmarking may be performed now that channels and threads have been developed.
Profiling with Flamegraphs
Profiling performed with cargo flamegraph.
To run, execute the following from within the readstat directory.
cargo flamegraph --bin readstat -- data tests/data/_ahs2019n.sas7bdat --output tests/data/_ahs2019n.csv
Flamegraph is written to readstat/flamegraph.svg.
π Have yet to utilize flamegraphs in order to improve performance.
GitHub Actions
The CI/CD workflow can be triggered in multiple ways:
1. Tag Push (Release)
Push a tag to trigger a full release build with GitHub Release artifacts:
# add and commit local changes
git add .
git commit -m "commit msg"
# push local changes to remote
git push
# add local tag
git tag -a v0.1.0 -m "v0.1.0"
# push local tag to remote
git push origin --tags
To delete and recreate tags:
# delete local tag
git tag --delete v0.1.0
# delete remote tag
git push origin --delete v0.1.0
2. Manual Trigger (GitHub UI)
Trigger a build manually from the GitHub Actions web interface (build-only, no releases):
- Go to the Actions tab
- Select the readstat-rs workflow
- Click Run workflow
- Optionally specify:
- Version string: Label for artifacts (default:
dev)
- Version string: Label for artifacts (default:
π Manual triggers only build artifacts and do not create GitHub releases. To create a release, use a tag push.
3. API Trigger (External Tools)
Trigger builds programmatically using the GitHub API. This is useful for automation tools like Claude Code.
Using gh CLI
# Trigger a build
gh api repos/curtisalexander/readstat-rs/dispatches \
-f event_type=build
# Trigger a build with custom version label
gh api repos/curtisalexander/readstat-rs/dispatches \
-f event_type=build \
-F client_payload='{"version":"test-build-123"}'
Using curl
curl -X POST \
-H "Authorization: token $GITHUB_TOKEN" \
-H "Accept: application/vnd.github.v3+json" \
https://api.github.com/repos/curtisalexander/readstat-rs/dispatches \
-d '{"event_type": "build", "client_payload": {"version": "dev"}}'
4. Claude Code Integration
To have Claude Code trigger a CI build, use this prompt:
Trigger a CI build for readstat-rs by running:
gh api repos/curtisalexander/readstat-rs/dispatches -f event_type=build
Event Types
Repository dispatch event types for API triggers:
| Event Type | Description |
|---|---|
build | Build all targets and upload artifacts |
test | Same as build (alias for clarity) |
release | Same as build (reserved for future use) |
π API triggers only build artifacts and do not create GitHub releases. To create a release, use a tag push.
Artifacts
All builds (regardless of trigger method) upload artifacts that can be downloaded from the workflow run page. Artifacts are retained for the default GitHub Actions retention period.
Releasing to crates.io
Step-by-step guide for publishing readstat-rs crates to crates.io.
Quick Reference
# Run all pre-publish checks
./scripts/release-check.sh # Linux/macOS
.\scripts\release-check.ps1 # Windows
# Switch vendor dirs from submodules to copied files
./scripts/vendor.sh prepare # Linux/macOS
.\scripts\vendor.ps1 prepare # Windows
# Publish (in dependency order)
cargo publish -p readstat-iconv-sys
cargo publish -p readstat-sys
cargo publish -p readstat
cargo publish -p readstat-cli
# Restore submodules after publishing
./scripts/vendor.sh restore # Linux/macOS
.\scripts\vendor.ps1 restore # Windows
Pre-Release Checklist
1. Version Bumps
Update version numbers in these files (keep them in sync):
| File | Fields |
|---|---|
crates/readstat/Cargo.toml | version |
crates/readstat-cli/Cargo.toml | version, readstat dependency version |
crates/readstat-sys/Cargo.toml | version |
crates/readstat-iconv-sys/Cargo.toml | version |
crates/readstat/Cargo.toml | readstat-sys dependency version |
crates/readstat-sys/Cargo.toml | readstat-iconv-sys dependency version |
Version conventions:
readstatandreadstat-clishare the same version (e.g.0.20.0)readstat-sysandreadstat-iconv-sysshare the same version (e.g.0.3.0)- Bump sys crate versions only when the vendored C library or bindings change
2. Update CHANGELOG.md
Add an entry for the new version:
## [0.20.0] - 2026-XX-XX
### Added
- ...
### Changed
- ...
### Fixed
- ...
3. Run Automated Checks
./scripts/release-check.sh
This runs:
cargo fmt --all -- --checkβ formattingcargo clippy --workspaceβ lintingreadstat-wasmfmt and clippy (excluded from workspace, checked separately)cargo test --workspaceβ all testscargo doc --workspace --no-depsβ documentation buildcargo deny checkβ license and security audit (if installed)- Version consistency checks
- CHANGELOG entry check
cargo packagedry-run for each publishable crate
Fix any failures before proceeding.
4. Manual Checks
- README.md is up to date
- Documentation reflects any API changes
- Architecture docs (
docs/ARCHITECTURE.md) are current - mdbook builds cleanly:
./scripts/build-book.sh -
readstat-wasmbuilds and exports are up to date (excluded from workspace; not published to crates.io)
Vendor Preparation
The readstat-sys and readstat-iconv-sys crates vendor C source code from git
submodules. cargo publish cannot include git submodule contents, so the files
must be copied as regular files before publishing.
Switch to publish mode
./scripts/vendor.sh prepare # Linux/macOS
.\scripts\vendor.ps1 prepare # Windows
This:
- Records submodule commit hashes in
vendor-lock.txt - Copies only the files needed for building (matching
Cargo.tomlincludepatterns) - Deinitializes the git submodules
- Places the copied files in the vendor directories
Verify package contents
cargo package --list -p readstat-sys --allow-dirty
cargo package --list -p readstat-iconv-sys --allow-dirty
Publishing
Crates must be published in dependency order. Wait for each crate to appear on the crates.io index before publishing the next one.
# 1. No crate dependencies
cargo publish -p readstat-iconv-sys
# 2. Depends on readstat-iconv-sys (Windows only)
cargo publish -p readstat-sys
# 3. Depends on readstat-sys
cargo publish -p readstat
# 4. Depends on readstat
cargo publish -p readstat-cli
Note: There may be a delay (30 seconds to a few minutes) between publishing
a crate and it appearing in the index. If cargo publish fails with a dependency
resolution error, wait and retry.
Post-Publish
1. Restore submodules
./scripts/vendor.sh restore # Linux/macOS
.\scripts\vendor.ps1 restore # Windows
2. Create a git tag
git tag v0.20.0
git push origin v0.20.0
3. Create a GitHub release
Use the GitHub CLI or web UI to create a release from the tag. The CI pipeline
(main.yml) will automatically build platform binaries and attach them.
4. Clean up
- Remove
vendor-lock.txt(or commit it for reference) - Verify the published crates on crates.io
- Verify the docs on docs.rs
Troubleshooting
cargo publish fails with βno matching package foundβ
The dependency crate hasnβt appeared in the index yet. Wait 30-60 seconds and retry.
cargo package includes too many files
Check the include field in the crateβs Cargo.toml. Run cargo package --list
to see exactly what will be included.
Vendor files missing after vendor.sh restore
Run git submodule update --init --recursive to re-initialize.
Build fails after switching vendor modes
Clean the build cache: cargo clean then rebuild.
readstat
Pure Rust library for parsing SAS binary files (.sas7bdat) into Apache Arrow RecordBatch format. Uses FFI bindings to the ReadStat C library for parsing.
Note: The ReadStat C library supports SAS, SPSS, and Stata file formats. The
readstat-syscrate exposes the full ReadStat API β all 125 functions across all formats. However, this crate currently only implements parsing and conversion for SAS.sas7bdatfiles. SPSS and Stata formats are not supported.
Features
Output format writers are feature-gated (all enabled by default):
csvβ CSV output viaarrow-csvparquetβ Parquet output (Snappy, Zstd, Brotli, Gzip, Lz4 compression)featherβ Arrow IPC / Feather formatndjsonβ Newline-delimited JSONsqlβ DataFusion SQL query support (optional, not enabled by default)
Key Types
ReadStatDataβ Coordinates FFI parsing, accumulates values directly into typed Arrow buildersReadStatMetadataβ File-level metadata (row/var counts, encoding, compression, schema)ReadStatWriterβ Writes Arrow batches to the requested output formatReadStatPathβ Validated input file pathWriteConfigβ Output configuration (path, format, compression)
For the full architecture overview, see docs/ARCHITECTURE.md.
readstat-cli
Binary crate producing the readstat CLI tool for converting SAS binary files (.sas7bdat) to other formats.
Note: The ReadStat C library supports SAS, SPSS, and Stata file formats. The
readstat-syscrate exposes the full ReadStat API β all 125 functions across all formats. However, this CLI currently only supports SAS.sas7bdatfiles. SPSS and Stata formats are not supported.
Subcommands
metadataβ Print file metadata (row/var counts, labels, encoding, format version, etc.)previewβ Preview first N rows as CSV to stdoutdataβ Convert to output format (csv, feather, ndjson, parquet)
Key Features
- Column selection (
--columns,--columns-file) - Streaming reads with configurable chunk size (
--stream-rows) - Parallel reading (
--parallel) and parallel Parquet writing (--parallel-write) - SQL queries via DataFusion (
--sql, feature-gated) - Parquet compression settings (
--compression,--compression-level)
For the full CLI reference, see docs/USAGE.md.
readstat-sys
Raw FFI bindings to the ReadStat C library, generated with bindgen.
The build.rs script compiles ~49 C source files from the vendored vendor/ReadStat/ git submodule via the cc crate and generates Rust bindings with bindgen. Platform-specific linking for iconv and zlib is handled automatically (see docs/BUILDING.md for details).
These bindings expose the full ReadStat API β all 125 functions and all 8 enum types β including support for SAS (.sas7bdat, .xpt), SPSS (.sav, .zsav, .por), and Stata (.dta) file formats. If you need to work with SPSS or Stata files from Rust, this crate provides the complete FFI surface to do so.
This is a sys crate β it exposes raw C types and functions. The higher-level readstat library crate provides a safe API but currently only implements support for SAS .sas7bdat files.
API Coverage
All 125 public C functions and all 8 enum types from readstat.h are bound. All 49 library source files are compiled.
Functions by Category
| Category | Count | Formats |
|---|---|---|
| Metadata accessors | 15 | All |
| Value accessors | 14 | All |
| Variable accessors | 14 | All |
| Parser lifecycle | 3 | All |
| Parser callbacks | 7 | All |
| Parser I/O handlers | 6 | All |
| Parser config | 4 | All |
| File parsers (readers) | 10 | SAS (sas7bdat, sas7bcat, xport), SPSS (sav, por), Stata (dta), text schema (sas_commands, spss_commands, stata_dictionary, txt) |
| Schema parsing | 1 | All |
| Writer lifecycle | 3 | All |
| Writer label sets | 5 | All |
| Writer variable definition | 11 | All |
| Writer notes/strings | 3 | All |
| Writer metadata setters | 8 | All |
| Writer begin | 6 | SAS (sas7bdat, sas7bcat, xport), SPSS (sav, por), Stata (dta) |
| Writer validation | 2 | All |
| Writer row insertion | 12 | All |
| Error handling | 1 | All |
| Total | 125 |
Compiled Source Files
| Directory | Files | Description |
|---|---|---|
src/ (core) | 11 | Hash table, parser, value/variable handling, writer, I/O, error |
src/sas/ | 11 | SAS7BDAT, SAS7BCAT, XPORT read/write, IEEE float, RLE compression |
src/spss/ | 16 | SAV, POR, ZSAV read/write, compression, SPSS parsing |
src/stata/ | 4 | DTA read/write, timestamp parsing |
src/txt/ | 7 | SAS commands, SPSS commands, Stata dictionary, plain text, schema |
| Total | 49 |
Enum Types
| C Enum | Rust Type Alias | Description |
|---|---|---|
readstat_type_e | readstat_type_e | Data types (string, int8/16/32, float, double, string_ref) |
readstat_type_class_e | readstat_type_class_e | Type classes (string, numeric) |
readstat_measure_e | readstat_measure_e | Measurement levels (nominal, ordinal, scale) |
readstat_alignment_e | readstat_alignment_e | Column alignment (left, center, right) |
readstat_compress_e | readstat_compress_e | Compression types (none, rows, binary) |
readstat_endian_e | readstat_endian_e | Byte order (big, little) |
readstat_error_e | readstat_error_e | Error codes (41 variants) |
readstat_io_flags_e | readstat_io_flags_e | I/O flags |
Verifying Bindings
To confirm that the Rust bindings stay in sync with the vendored C header and source files, run the verification script:
# Bash (Linux, macOS, Windows Git Bash)
bash crates/readstat-sys/verify_bindings.sh
# Rebuild first, then verify
bash crates/readstat-sys/verify_bindings.sh --rebuild
# PowerShell (Windows)
.\crates\readstat-sys\verify_bindings.ps1
# Rebuild first, then verify
.\crates\readstat-sys\verify_bindings.ps1 -Rebuild
The script checks three things:
- Every function declared in
readstat.hhas apub fnbinding in the generatedbindings.rs - Every
typedef enumin the header has a corresponding Rust type alias - Every
.clibrary source file in the vendor directory is listed inbuild.rs
Run this after updating the ReadStat submodule to catch any new or removed API surface.
readstat-iconv-sys
Windows-only FFI bindings to libiconv for character encoding conversion.
The build.rs script compiles libiconv from the vendored vendor/libiconv-win-build/ git submodule using the cc crate. On non-Windows platforms the build script is a no-op.
The links = "iconv" key in Cargo.toml allows readstat-sys to discover the include path via the DEP_ICONV_INCLUDE environment variable.
readstat-tests
Integration test suite for the readstat library and readstat-cli binary.
Contains 29 test modules covering all SAS data types, 118 date/time/datetime formats, missing values, large pages, CLI subcommands, parallel read/write, Parquet output, CSV output, Arrow migration, row offsets, scientific notation, column selection, skip row count, memory-mapped file reading, byte-slice reading, and SQL queries.
Test data lives in tests/data/*.sas7bdat (14 datasets). SAS scripts to regenerate test data are in util/.
Run with:
cargo test -p readstat-tests
readstat-wasm
WebAssembly build of the readstat library for parsing SAS .sas7bdat files in JavaScript. Reads metadata and converts row data to CSV, NDJSON, Parquet, or Feather (Arrow IPC) entirely in memory β no server or native dependencies required at runtime.
Package contents
The pkg/ directory contains everything needed to use the library from JavaScript:
| File | Description |
|---|---|
readstat_wasm.wasm | Pre-built WASM binary (Emscripten target) |
readstat_wasm.js | JS wrapper handling module loading, memory management, and type conversion |
JS API
All functions accept a Uint8Array of raw .sas7bdat file bytes.
import { init, read_metadata, read_metadata_fast, read_data, read_data_ndjson, read_data_parquet, read_data_feather } from "readstat-wasm";
// Must be called once before using any other function
await init();
const bytes = new Uint8Array(/* .sas7bdat file contents */);
// Metadata (returns JSON string)
const metadataJson = read_metadata(bytes);
const metadataJsonFast = read_metadata_fast(bytes); // skips full row count
// Data as text (returns string)
const csv = read_data(bytes); // CSV with header row
const ndjson = read_data_ndjson(bytes); // newline-delimited JSON
// Data as binary (returns Uint8Array)
const parquet = read_data_parquet(bytes); // Parquet bytes
const feather = read_data_feather(bytes); // Feather (Arrow IPC) bytes
Functions
| Function | Returns | Description |
|---|---|---|
init() | Promise<void> | Load and initialize the WASM module |
read_metadata(bytes) | string | File and variable metadata as JSON |
read_metadata_fast(bytes) | string | Same as above but skips full row count for speed |
read_data(bytes) | string | All row data as CSV (with header) |
read_data_ndjson(bytes) | string | All row data as newline-delimited JSON |
read_data_parquet(bytes) | Uint8Array | All row data as Parquet bytes |
read_data_feather(bytes) | Uint8Array | All row data as Feather (Arrow IPC) bytes |
How it works
The crate compiles the ReadStat C library and the Rust readstat parsing library to WebAssembly using the wasm32-unknown-emscripten target. Emscripten is required because the underlying C code needs a C standard library (libc, iconv).
The data functions perform a two-pass parse over the byte buffer: first to extract metadata (schema, row count), then to read row values into an Arrow RecordBatch, which is serialized to CSV, NDJSON, Parquet, or Feather in memory.
C ABI exports
The WASM module exposes these C-compatible functions (used internally by the JS wrapper):
| Export | Signature | Purpose |
|---|---|---|
read_metadata | (ptr, len) -> *char | Parse metadata as JSON |
read_metadata_fast | (ptr, len) -> *char | Same, skipping full row count |
read_data | (ptr, len) -> *char | Parse data, return as CSV |
read_data_ndjson | (ptr, len) -> *char | Parse data, return as NDJSON |
read_data_parquet | (ptr, len, out_len) -> *u8 | Parse data, return as Parquet bytes |
read_data_feather | (ptr, len, out_len) -> *u8 | Parse data, return as Feather bytes |
free_string | (ptr) | Free a string returned by the above |
free_binary | (ptr, len) | Free a binary buffer returned by parquet/feather |
Building from source
Requires Rust, Emscripten SDK, and libclang.
# Activate Emscripten
source /path/to/emsdk/emsdk_env.sh
# Add the target (first time only)
rustup target add wasm32-unknown-emscripten
# Initialize submodules (first time only, from repo root)
git submodule update --init --recursive
# Build
cargo build --target wasm32-unknown-emscripten --release
# Copy binary to pkg/
cp target/wasm32-unknown-emscripten/release/readstat_wasm.wasm pkg/
See the bun-demo for a working example.
readstat CLI Demo
Demonstrates converting a SAS .sas7bdat file to CSV, NDJSON, Parquet, and Feather using the readstat command-line tool.
Quick start
Linux / macOS
# Build the CLI (from repo root)
cargo build -p readstat-cli
# Run the conversion script
cd examples/cli-demo
bash convert.sh
# Verify the output files
uv run verify_output.py
You can also pass a specific path to the readstat binary:
bash convert.sh /path/to/readstat
Windows (PowerShell)
# Build the CLI (from repo root)
cargo build -p readstat-cli
# Run the conversion script
cd examples/cli-demo
./convert.ps1
# Verify the output files
uv run verify_output.py
You can also pass a specific path to the readstat binary:
./convert.ps1 -ReadStat C:\path\to\readstat.exe
What it does
The convert.sh (Bash) and convert.ps1 (PowerShell) scripts:
- Displays metadata for the
cars.sas7bdatdataset (table name, encoding, row count, variable info) - Previews the first 5 rows of data
- Converts the dataset to four output formats:
cars.csvβ comma-separated valuescars.ndjsonβ newline-delimited JSONcars.parquetβ Apache Parquet (columnar binary)cars.featherβ Arrow IPC / Feather (columnar binary)
The verify_output.py script validates all output files:
- Checks row and column counts match the expected 1,081 rows x 13 columns
- Verifies column names are correct
- Confirms cross-format consistency (all four formats contain identical data)
The cars dataset
| Property | Value |
|---|---|
| Rows | 1,081 |
| Columns | 13 |
| Source | crates/readstat-tests/tests/data/cars.sas7bdat |
| Encoding | WINDOWS-1252 |
Columns: Brand, Model, Minivan, Wagon, Pickup, Automatic, EngineSize, Cylinders, CityMPG, HwyMPG, SUV, AWD, Hybrid
Expected output
Using readstat: /path/to/readstat
Input file: /path/to/cars.sas7bdat
=== Metadata ===
...
=== Preview (first 5 rows) ===
...
Converting to CSV...
-> cars.csv
Converting to NDJSON...
-> cars.ndjson
Converting to Parquet...
-> cars.parquet
Converting to Feather...
-> cars.feather
Done! All output files written to /path/to/examples/cli-demo
Run 'uv run verify_output.py' to validate the output files.
API Server Demo
Two identical API servers demonstrating how to integrate readstat into backend applications:
- Rust server (Axum) β direct library integration
- Python server (FastAPI) β cross-language integration via PyO3/maturin bindings
Both servers expose the same endpoints and return identical results for the same input.
Prerequisites
Rust server:
- Rust toolchain
libclang(for readstat-sys bindgen)- Git submodules initialized:
git submodule update --init --recursive
Python server:
- Everything above, plus:
- uv (Python package manager)
- Python 3.9+
Quick Start
Rust Server (port 3000)
cd examples/api-demo/rust-server
cargo run
You should see:
Rust API server listening on http://localhost:3000
Python Server (port 3001)
cd examples/api-demo/python-server
# Build the PyO3 bindings into the project venv
uv sync
uv run maturin develop -m readstat_py/Cargo.toml
# Start the server
uv run uvicorn server:app --port 3001
You should see:
INFO: Started server process [...]
INFO: Uvicorn running on http://127.0.0.1:3001 (Press CTRL+C to quit)
Walking Through the Endpoints
The examples below use port 3000 (Rust server). Replace with 3001 for the Python server β the responses are identical.
Set a convenience variable for the test file:
FILE=test-data/cars.sas7bdat
1. Health Check
curl http://localhost:3000/health
Expected output:
{"status":"ok"}
2. File Metadata
Upload a SAS file and get back its metadata as JSON:
curl -F "file=@$FILE" http://localhost:3000/metadata
Expected output (formatted):
{
"row_count": 1081,
"var_count": 13,
"table_name": "CARS",
"file_label": "Written by SAS",
"file_encoding": "WINDOWS-1252",
"version": 9,
"is64bit": 0,
"creation_time": "2008-09-30 12:55:01",
"modified_time": "2008-09-30 12:55:01",
"compression": "None",
"endianness": "Little",
"vars": {
"0": {
"var_name": "Brand",
"var_type": "String",
"var_type_class": "String",
"var_label": "",
"var_format": "",
"var_format_class": null,
"storage_width": 13,
"display_width": 0
},
"1": {
"var_name": "Model",
"var_type": "String",
"var_type_class": "String",
...
},
...
}
}
The vars map is keyed by column index and includes type info, labels, and SAS format metadata for all 13 variables.
3. Preview Rows
Get the first N rows as CSV (default 10, here we ask for 5):
curl -F "file=@$FILE" "http://localhost:3000/preview?rows=5"
Expected output:
Brand,Model,Minivan,Wagon,Pickup,Automatic,EngineSize,Cylinders,CityMPG,HwyMPG,SUV,AWD,Hybrid
TOYOTA,Prius,0.0,0.0,0.0,1.0,1.5,4.0,60.0,51.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,1.0,1.3,4.0,48.0,47.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,1.0,1.3,4.0,47.0,48.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,0.0,1.3,4.0,46.0,51.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,0.0,1.3,4.0,45.0,51.0,0.0,0.0,1.0
4. Convert to CSV
Export the full dataset (all 1,081 rows) as CSV:
curl -F "file=@$FILE" "http://localhost:3000/data?format=csv" -o output.csv
The response has Content-Type: text/csv and Content-Disposition: attachment; filename="data.csv".
5. Convert to NDJSON
Export as newline-delimited JSON (one JSON object per row):
curl -F "file=@$FILE" "http://localhost:3000/data?format=ndjson"
Expected output (first few lines):
{"Brand":"TOYOTA","Model":"Prius","Minivan":0.0,"Wagon":0.0,"Pickup":0.0,"Automatic":1.0,"EngineSize":1.5,"Cylinders":4.0,"CityMPG":60.0,"HwyMPG":51.0,"SUV":0.0,"AWD":0.0,"Hybrid":1.0}
{"Brand":"HONDA","Model":"Civic Hybrid","Minivan":0.0,"Wagon":0.0,"Pickup":0.0,"Automatic":1.0,"EngineSize":1.3,"Cylinders":4.0,"CityMPG":48.0,"HwyMPG":47.0,"SUV":0.0,"AWD":0.0,"Hybrid":1.0}
{"Brand":"HONDA","Model":"Civic Hybrid","Minivan":0.0,"Wagon":0.0,"Pickup":0.0,"Automatic":1.0,"EngineSize":1.3,"Cylinders":4.0,"CityMPG":47.0,"HwyMPG":48.0,"SUV":0.0,"AWD":0.0,"Hybrid":1.0}
...
The response has Content-Type: application/x-ndjson.
6. Convert to Parquet
Export as Apache Parquet (binary, Snappy-compressed):
curl -F "file=@$FILE" "http://localhost:3000/data?format=parquet" -o output.parquet
This produces a ~15 KB Parquet file. You can inspect it with tools like parquet-tools, DuckDB, or pandas:
import pandas as pd
print(pd.read_parquet("output.parquet").head())
7. Convert to Feather
Export as Arrow IPC (Feather v2) format:
curl -F "file=@$FILE" "http://localhost:3000/data?format=feather" -o output.feather
This produces a ~130 KB Feather file. Read it back with any Arrow-compatible tool:
import pandas as pd
print(pd.read_feather("output.feather").head())
Automated Test Scripts
Both scripts work against either server β just change the URL.
Shell script (curl)
cd examples/api-demo
bash client/test_api.sh http://localhost:3000 test-data/cars.sas7bdat
bash client/test_api.sh http://localhost:3001 test-data/cars.sas7bdat
Python script (httpx)
Uses PEP 723 inline script metadata, so uv run handles dependencies automatically β no virtual environment setup needed:
cd examples/api-demo/client
uv run test_api.py http://localhost:3000 ../test-data/cars.sas7bdat
uv run test_api.py http://localhost:3001 ../test-data/cars.sas7bdat
Expected output:
=== Testing http://localhost:3000 with ../test-data/cars.sas7bdat ===
--- GET /health ---
{'status': 'ok'}
--- POST /metadata ---
row_count: 1081
var_count: 13
table_name: CARS
encoding: WINDOWS-1252
variables: 13
--- POST /preview (5 rows) ---
Brand,Model,Minivan,Wagon,Pickup,Automatic,EngineSize,Cylinders,CityMPG,HwyMPG,SUV,AWD,Hybrid
TOYOTA,Prius,0.0,0.0,0.0,1.0,1.5,4.0,60.0,51.0,0.0,0.0,1.0
...
--- POST /data?format=csv ---
Brand,Model,Minivan,Wagon,Pickup,Automatic,EngineSize,Cylinders,CityMPG,HwyMPG,SUV,AWD,Hybrid
TOYOTA,Prius,0.0,0.0,0.0,1.0,1.5,4.0,60.0,51.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,1.0,1.3,4.0,48.0,47.0,0.0,0.0,1.0
--- POST /data?format=ndjson ---
{"Brand":"TOYOTA","Model":"Prius","Minivan":0.0,...}
...
--- POST /data?format=parquet ---
15403 bytes
--- POST /data?format=feather ---
129650 bytes
=== All tests passed ===
API Reference
| Method | Path | Request | Response | Content-Type |
|---|---|---|---|---|
GET | /health | β | {"status": "ok"} | application/json |
POST | /metadata | multipart file | JSON metadata | application/json |
POST | /preview?rows=N | multipart file | CSV text (first N rows, default 10) | text/csv |
POST | /data?format=csv | multipart file | Full dataset as CSV | text/csv |
POST | /data?format=ndjson | multipart file | Full dataset as NDJSON | application/x-ndjson |
POST | /data?format=parquet | multipart file | Full dataset as Parquet | application/octet-stream |
POST | /data?format=feather | multipart file | Full dataset as Feather | application/octet-stream |
The multipart field name must be file. Binary formats include a Content-Disposition header with a suggested filename.
How It Works
Rust Server
HTTP upload β Axum multipart extraction β Vec<u8>
β spawn_blocking {
ReadStatMetadata::read_metadata_from_bytes()
ReadStatData::read_data_from_bytes() β Arrow RecordBatch
write_batch_to_{csv,ndjson,parquet,feather}_bytes()
}
β HTTP response
All ReadStat C library FFI calls run inside spawn_blocking to avoid blocking the tokio async runtime.
Python Server
HTTP upload β FastAPI UploadFile β bytes
β readstat_py.read_to_{csv,ndjson,parquet,feather}(bytes)
β [PyO3 boundary]
β ReadStatMetadata::read_metadata_from_bytes()
β ReadStatData::read_data_from_bytes() β Arrow RecordBatch
β write_batch_to_*_bytes()
β [back to Python]
β HTTP response
The PyO3 binding layer is intentionally thin β 5 functions that take &[u8] and return Vec<u8> (or String for metadata). No complex types cross the FFI boundary.
readstat-wasm Bun Demo
Demonstrates reading SAS .sas7bdat file metadata and data from JavaScript using the readstat-wasm package compiled to WebAssembly via Emscripten. The demo parses a .sas7bdat file entirely in-memory via WASM and converts it to CSV.
Quick start
If you already have Rust, Emscripten SDK, libclang, and Bun installed:
macOS / Linux:
# Activate Emscripten (first time per terminal session)
source /path/to/emsdk/emsdk_env.sh
# Add the wasm target (first time only)
rustup target add wasm32-unknown-emscripten
# Initialize submodules (first time only)
git submodule update --init --recursive
# Build the wasm package
cd crates/readstat-wasm
cargo build --target wasm32-unknown-emscripten --release
cp target/wasm32-unknown-emscripten/release/readstat_wasm.wasm pkg/
# Run the demo
cd ../../examples/bun-demo
bun install
bun run index.ts
Windows (Git Bash):
# Activate Emscripten (first time per terminal session)
/c/path/to/emsdk/emsdk.bat activate latest
export EMSDK=C:/path/to/emsdk
# Add the wasm target (first time only)
rustup target add wasm32-unknown-emscripten
# Initialize submodules (first time only)
git submodule update --init --recursive
# Build the wasm package
cd crates/readstat-wasm
cargo build --target wasm32-unknown-emscripten --release
cp target/wasm32-unknown-emscripten/release/readstat_wasm.wasm pkg/
# Run the demo
cd ../../examples/bun-demo
bun install
bun run index.ts
Windows (PowerShell):
# Activate Emscripten (first time per terminal session)
C:\path\to\emsdk\emsdk.bat activate latest
$env:EMSDK = "C:\path\to\emsdk"
# Add the wasm target (first time only)
rustup target add wasm32-unknown-emscripten
# Initialize submodules (first time only)
git submodule update --init --recursive
# Build the wasm package
cd crates\readstat-wasm
cargo build --target wasm32-unknown-emscripten --release
copy target\wasm32-unknown-emscripten\release\readstat_wasm.wasm pkg\
# Run the demo
cd ..\..\examples\bun-demo
bun install
bun run index.ts
1. Install dependencies
Rust + wasm target
# Install Rust (if not already installed)
# macOS / Linux
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Windows β download and run rustup-init.exe from https://rustup.rs
# Add the Emscripten wasm target (all platforms)
rustup target add wasm32-unknown-emscripten
Emscripten SDK
# Clone the SDK
git clone https://github.com/emscripten-core/emsdk.git
cd emsdk
# Install and activate the latest toolchain
./emsdk install latest
./emsdk activate latest
Activate in your shell (run every new terminal session, or add to your profile):
| Platform | Command |
|---|---|
| macOS / Linux | source ./emsdk_env.sh |
| Windows (cmd) | emsdk_env.bat |
| Windows (PowerShell) | emsdk_env.bat (then set $env:EMSDK = "C:\path\to\emsdk" if needed) |
| Windows (Git Bash) | source ./emsdk_env.sh (then export EMSDK=C:/path/to/emsdk if needed) |
Note: On Windows,
emsdk_env.sh/emsdk_env.batmay update PATH without exporting theEMSDKvariable. If the build fails with βEMSDK must be setβ, set it manually as shown above. The build script will also attempt to auto-detect the emsdk root from PATH.
libclang (required by bindgen)
| Platform | Command |
|---|---|
| macOS | brew install llvm |
| Ubuntu / Debian | sudo apt-get install libclang-dev |
| Fedora | sudo dnf install clang-devel |
| Windows | Install LLVM from https://releases.llvm.org/download.html and set LIBCLANG_PATH to the lib directory (e.g., C:\Program Files\LLVM\lib) |
Bun
# macOS / Linux
curl -fsSL https://bun.sh/install | bash
# Windows (PowerShell)
powershell -c "irm bun.sh/install.ps1 | iex"
2. Initialize git submodules
From the repository root:
git submodule update --init --recursive
3. Build the WASM package
# Make sure Emscripten is activated in your shell (see table above)
# From the readstat-wasm crate directory
cd crates/readstat-wasm
# Build with Emscripten target (release mode)
cargo build --target wasm32-unknown-emscripten --release
# Copy the .wasm binary into the pkg/ directory
# macOS / Linux
cp target/wasm32-unknown-emscripten/release/readstat_wasm.wasm pkg/
# Windows (PowerShell)
# copy target\wasm32-unknown-emscripten\release\readstat_wasm.wasm pkg\
4. Run the demo
cd examples/bun-demo
bun install
bun run index.ts
Expected output
=== SAS7BDAT Metadata ===
Table name: CARS
File encoding: WINDOWS-1252
Row count: 1081
Variable count:13
Compression: None
Endianness: Little
Created: 2008-09-30 12:55:01
Modified: 2008-09-30 12:55:01
=== Variables ===
[0] Brand (String, )
[1] Model (String, )
[2] Minivan (Double, )
[3] Wagon (Double, )
[4] Pickup (Double, )
[5] Automatic (Double, )
[6] EngineSize (Double, )
[7] Cylinders (Double, )
[8] CityMPG (Double, )
[9] HwyMPG (Double, )
[10] SUV (Double, )
[11] AWD (Double, )
[12] Hybrid (Double, )
=== CSV Data (preview) ===
Brand,Model,Minivan,Wagon,Pickup,Automatic,EngineSize,Cylinders,CityMPG,HwyMPG,SUV,AWD,Hybrid
TOYOTA,Prius,0.0,0.0,0.0,1.0,1.5,4.0,60.0,51.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,1.0,1.3,4.0,48.0,47.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,1.0,1.3,4.0,47.0,48.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,0.0,1.3,4.0,46.0,51.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,0.0,1.3,4.0,45.0,51.0,0.0,0.0,1.0
... (1081 total data rows)
Wrote 1081 rows to cars.csv
How it works
The readstat-wasm crate compiles the ReadStat C library and the Rust readstat parsing library to WebAssembly using the wasm32-unknown-emscripten target. Emscripten is required because the underlying ReadStat C code needs a C standard library (libc, iconv) β which Emscripten provides for wasm. (Note: zlib is only needed for SPSS zsav support, which is not included in the current wasm build.)
The crate exports eight C-compatible functions:
| Export | Signature | Purpose |
|---|---|---|
read_metadata | (ptr, len) -> *char | Parse metadata as JSON from a byte buffer |
read_metadata_fast | (ptr, len) -> *char | Same, but skips full row count |
read_data | (ptr, len) -> *char | Parse data and return as CSV string |
read_data_ndjson | (ptr, len) -> *char | Parse data and return as NDJSON string |
read_data_parquet | (ptr, len, out_len) -> *u8 | Parse data and return as Parquet bytes |
read_data_feather | (ptr, len, out_len) -> *u8 | Parse data and return as Feather bytes |
free_string | (ptr) | Free a string returned by the string functions |
free_binary | (ptr, len) | Free binary data returned by parquet/feather |
The data functions perform a two-pass parse over the same byte buffer: first to extract metadata (schema, row count), then to read row values into an Arrow RecordBatch, which is serialized to CSV or NDJSON in memory.
The JS wrapper in pkg/readstat_wasm.js handles:
- Loading the
.wasmmodule - Providing minimal WASI and Emscripten import stubs
- Memory management (malloc/free for input bytes, free_string for output)
- Converting between JS types and wasm pointers
Troubleshooting
EMSDK must be set for Emscripten builds
Set the EMSDK environment variable to point to your emsdk installation directory. On macOS/Linux: export EMSDK=/path/to/emsdk. On Windows (PowerShell): $env:EMSDK = "C:\path\to\emsdk". On Windows (Git Bash): export EMSDK=C:/path/to/emsdk. The build script also attempts to auto-detect the emsdk root from your PATH, so simply having Emscripten activated may be sufficient.
error: linking with emcc failed / undefined symbol: main
Make sure youβre building from crates/readstat-wasm/ (not the repo root). The .cargo/config.toml in that directory provides the necessary linker flags.
The command line is too long (Windows)
This was a known issue when building all ReadStat C source files for the Emscripten target. It has been fixed β the build script now compiles only the SAS format sources for Emscripten builds, keeping the archiver command within Windowsβ command-line length limit.
Web Demo: SAS7BDAT Viewer & Converter
Browser-based demo that reads SAS .sas7bdat files entirely client-side using WebAssembly. Upload a file to view metadata, preview data in a sortable table, and export to CSV, NDJSON, Parquet, or Feather.
No build tools, no npm install, no framework β just static files served over HTTP.
Quick start
-
Copy the WASM binary into this directory (if not already present):
cp crates/readstat-wasm/pkg/readstat_wasm.wasm examples/web-demo/If you need to rebuild it first, see the bun-demo README for build instructions.
-
Serve the directory with any static HTTP server. You must point the server at the directory, not at
index.htmldirectly:# From the repo root: python -m http.server 8000 -d examples/web-demo npx serve examples/web-demo bunx serve examples/web-demo # Or from the web-demo directory: cd examples/web-demo python -m http.server 8000 npx serve bunx serveNote: Do not pass
index.htmlas the argument (e.g.,bunx serve index.html). That tellsserveto look for a directory namedindex.html, which will cause the WASM and JS files to 404. -
Open
http://localhost:3000(forserve) orhttp://localhost:8000(for Python) in your browser. -
Upload a
.sas7bdatfile (e.g.,crates/readstat-tests/tests/data/cars.sas7bdat).
Features
- Metadata panel β table name, encoding, row/variable count, compression, timestamps
- Variable table β name, type, label, and format for each column
- Data preview β first 100 rows in a sortable table (uses Tabulator from CDN, with plain HTML table fallback)
- Export β download as CSV, NDJSON, Parquet, or Feather
WASM binary
The readstat_wasm.wasm file is built from the readstat-wasm crate (crates/readstat-wasm/). It compiles the ReadStat C library and the Rust readstat parsing library to WebAssembly via the wasm32-unknown-emscripten target. The binary is ~9.7 MB.
A pre-built copy is checked in at crates/readstat-wasm/pkg/readstat_wasm.wasm.
Browser compatibility
- Requires a modern browser with WebAssembly support (Chrome 57+, Firefox 52+, Safari 11+, Edge 16+)
- Must be served over HTTP(S) β
file://URLs will not work due to WASMfetch()requirements - Tabulator.js is loaded from CDN; if offline, the data preview falls back to a plain HTML table
File structure
examples/web-demo/
βββ index.html # App (HTML + inline CSS + inline JS)
βββ readstat_wasm.js # Browser-compatible WASM wrapper
βββ readstat_wasm.wasm # WASM binary (copied from pkg/)
βββ README.md # This file
SAS7BDAT SQL Explorer
An interactive browser-based tool for uploading .sas7bdat files and querying them with SQL β entirely client-side using WebAssembly.
How It Works
- Upload a
.sas7bdatfile (drag-and-drop or file picker) - The file is parsed in-browser via the
readstat-wasmWebAssembly module - Data is loaded into AlaSQL, a client-side SQL engine
- Write SQL queries in a syntax-highlighted editor (powered by CodeMirror 6)
- View results in an interactive, sortable table (powered by Tabulator)
- Export query results as CSV
No data leaves your browser β all processing happens locally.
Quick Start
Serve the directory with any static HTTP server. The entire directory must be served (not just index.html) so the browser can load the .js and .wasm files alongside it.
From the repository root:
# Python
python -m http.server 8000 -d examples/sql-explorer
# Bun
bunx serve examples/sql-explorer
Or cd into the directory and serve from there:
cd examples/sql-explorer
# Python
python -m http.server 8000
# Bun
bunx serve .
Then open http://localhost:8000 in your browser.
Note: The page must be served over HTTP(S) β opening
index.htmldirectly as afile://URL wonβt work because browsers block WASM loading from the local filesystem.
WASM Files
The readstat_wasm.js and readstat_wasm.wasm files are copies from examples/web-demo/. If you rebuild the WASM module, copy the updated files here as well.
To rebuild from source (requires Emscripten):
cd crates/readstat-wasm
./build.sh
cp pkg/readstat_wasm.js pkg/readstat_wasm.wasm ../../examples/sql-explorer/
CDN Dependencies
All loaded automatically from CDNs β no npm install required:
| Library | Version | CDN | Purpose |
|---|---|---|---|
| AlaSQL | 4.x | jsdelivr | Client-side SQL engine |
| CodeMirror 6 | 6.x | esm.sh | SQL editor with syntax highlighting |
| Tabulator | 6.x | unpkg | Interactive sortable/filterable result tables |
Example Queries
Once a file is loaded, the data is available as a table named data. Some queries to try:
-- Preview all rows
SELECT * FROM data LIMIT 100
-- Count rows
SELECT COUNT(*) AS total_rows FROM data
-- Filter rows
SELECT * FROM data WHERE column_name = 'value'
-- Aggregate
SELECT column_name, COUNT(*) AS n FROM data GROUP BY column_name ORDER BY n DESC
-- Select specific columns
SELECT col1, col2, col3 FROM data LIMIT 50
Column names with spaces or special characters should be wrapped in square brackets: [Column Name].
For the full list of supported SQL syntax, see the AlaSQL SQL Reference.
API Documentation (Rustdocs)
Auto-generated API documentation for each crate is available below:
- readstat β Library crate for parsing SAS files into Arrow
- readstat_cli β CLI binary
- readstat_sys β Raw FFI bindings to the ReadStat C library
- readstat_iconv_sys β Windows-only iconv FFI bindings
- readstat_tests β Integration test suite
Note: These docs are generated by
cargo docand deployed alongside this book by CI.