readstat-rs

Read, inspect, and convert SAS binary (.sas7bdat) files — from Rust code, the command line, or the browser. Converts to CSV, Parquet, Feather, and NDJSON using Apache Arrow.

The original use case was a command-line tool for converting SAS files, but the project has since expanded into a workspace of crates that can be used as a Rust library, a CLI, or compiled to WebAssembly for browser and JavaScript runtimes.

readstat CLI demo

:clapper: The demo above is generated from scripts/demo.sh and recorded with scripts/record-demo.sh.

🔑 Dependencies

The command-line tool is developed in Rust and is only possible due to the following excellent projects:

The ReadStat C library developed by Evan Miller
The arrow Rust crate developed by the Apache Arrow community

The ReadStat library is used to parse and read sas7bdat files, and the arrow crate is used to convert the read sas7bdat data into the Arrow memory format. Once in the Arrow memory format, the data can be written to other file formats.

💡 Note: The ReadStat C library supports SAS, SPSS, and Stata file formats. The readstat-sys crate exposes the full ReadStat API — all 125 functions across all formats. However, the higher-level crates (readstat, readstat-cli, readstat-wasm) currently only implement support for SAS .sas7bdat files.

🚀 CLI Quickstart

Convert the first 50,000 rows of example.sas7bdat (by performing the read in parallel) to the file example.parquet, overwriting the file if it already exists.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --rows 50000 --overwrite --parallel

📦 CLI Install

From Cargo

If you have Rust installed, the easiest way to install is via cargo:

cargo install readstat-cli

Download a Release

[Mostly] static binaries for Linux, macOS, and Windows may be found at the Releases page.

Setup

Move the readstat binary to a known directory and add the binary to the user’s PATH.

Linux & macOS

Ensure the path to readstat is added to the appropriate shell configuration file.

Windows

For Windows users, path configuration may be found within the Environment Variables menu. Executing the following from the command line opens the Environment Variables menu for the current user.

rundll32.exe sysdm.cpl,EditEnvironmentVariables

Alternatively, update the user-level PATH in PowerShell (replace C:\path\to\readstat with the actual directory):

$currentPath = [Environment]::GetEnvironmentVariable("Path", "User")
[Environment]::SetEnvironmentVariable("Path", "$currentPath;C:\path\to\readstat", "User")

After running the above, restart your terminal for the change to take effect.

Run

Run the binary.

readstat --help

⚙️ CLI Usage

The binary is invoked using subcommands:

metadata → writes file and variable metadata to standard out or JSON
preview → writes the first N rows of parsed data as csv to standard out
data → writes parsed data in csv, feather, ndjson, or parquet format to a file

Column metadata — labels, SAS format strings, and storage widths — is preserved in Parquet and Feather output as Arrow field metadata. See docs/TECHNICAL.md for details.

For a one-page visual overview see the CLI Cheatsheet (rendered). For the full CLI reference — including column selection, parallelism, memory considerations, SQL queries, reader modes, and debug options — see docs/USAGE.md.

For library, API server, and WebAssembly usage, see Examples below.

🛠️ Build from Source

Clone the repository (with submodules), install platform-specific developer tools, and run cargo build. Platform-specific instructions for Linux, macOS, and Windows are in docs/BUILDING.md.

:information_source: Minimum Supported Rust Version (MSRV): 1.88 (let-chains; Rust edition 2024). All published crates set rust-version = "1.88".

💻 Platform Support

Platform	Status	C library	Notes
Linux (glibc)	✅ Builds and runs	System iconv, system zlib	—
Linux (musl)	✅ Builds and runs	System iconv, system zlib	—
macOS	✅ Builds and runs	System `libiconv`, system zlib	—
Windows (MSVC)	✅ Builds and runs	Vendored iconv, vendored zlib	MSVC supported since ReadStat `1.1.5` (no msys2 needed). Default builds use pre-generated bindings — no `libclang` install required.

📚 Documentation

:notebook: Online Book (GitHub Pages) — full reference including installation, CLI usage, architecture, technical details, and memory safety.

Document	Description
docs/ARCHITECTURE.md	Crate layout, key types, and architectural patterns
docs/USAGE.md	Full CLI reference and examples
docs/readstat-cheatsheet.html	One-page printable CLI cheatsheet (rendered)
docs/BUILDING.md	Clone, build, and linking details per platform
docs/TECHNICAL.md	Floating-point precision and date/time handling
docs/TESTING.md	Running tests, dataset table, fuzz testing, valgrind
docs/BENCHMARKING.md	Criterion benchmarks, hyperfine, and profiling
docs/CI-CD.md	GitHub Actions triggers and artifacts
docs/MEMORY-SAFETY.md	Automated memory-safety CI checks (Miri, AddressSanitizer on Linux/macOS/Windows, weekly fuzzing; Valgrind run manually)
docs/RELEASING.md	Step-by-step guide for publishing crates to crates.io
scripts/check-updates.sh	Crate dependency update checker — supply-chain quarantine, held-back/major reporting, and a `bindgen` advisory (`--apply` to update; `.ps1` for Windows)
scripts/check-vendor-updates.sh	Read-only check for upstream updates to the vendored git submodules (ReadStat, libiconv) — never alters the checkout (`.ps1` for Windows)

🧩 Workspace Crates

Crate	Path	Description
`readstat`	`crates/readstat/`	Pure library for parsing SAS files into Arrow RecordBatch format. Output writers are feature-gated.
`readstat-cli`	`crates/readstat-cli/`	Binary crate producing the `readstat` CLI tool (arg parsing, progress bars, orchestration).
`readstat-sys`	`crates/readstat-sys/`	Raw FFI bindings to the full ReadStat C library (SAS, SPSS, Stata) via bindgen.
`readstat-iconv-sys`	`crates/readstat-iconv-sys/`	Windows-only FFI bindings to libiconv for character encoding conversion.
`readstat-tests`	`crates/readstat-tests/`	Integration test suite (30 modules, 14 datasets).
`readstat-wasm`	`crates/readstat-wasm/`	WebAssembly build for browser/JS usage (excluded from workspace, built with Emscripten).

For full architectural details, see docs/ARCHITECTURE.md.

💡 Examples

The examples/ directory contains runnable demos showing different ways to use readstat-rs.

Example	Description
`cli-demo`	Convert a `.sas7bdat` file to CSV, NDJSON, Parquet, and Feather using the `readstat` CLI
`api-demo`	API servers in Rust (Axum) and Python (FastAPI + PyO3) — upload, inspect, and convert SAS files over HTTP
`bun-demo`	Parse a `.sas7bdat` file from JavaScript using the WebAssembly build with Bun
`web-demo`	Browser-based viewer and converter — upload, preview, and export entirely client-side via WASM
`sql-explorer`	Browser-based SQL explorer — upload a `.sas7bdat` file and query it interactively with SQL via AlaSQL

To use readstat as a library in your own Rust project, add the readstat crate as a dependency.

⚖️ License

readstat-rs is licensed under the MIT License.

⚠️ Windows builds statically link LGPL libiconv. On Windows the readstat-iconv-sys crate compiles and statically links the vendored libiconv, which is LGPL-2.1-or-later. The Windows binary as a whole is therefore LGPL-2.1-or-later AND MIT; distributors of Windows binaries are subject to the LGPL §6 relinking obligation. Linux and macOS builds use the system iconv and are unaffected (MIT only).

🔗 Resources

The following have been incredibly helpful while developing!

How to not RiiR
Making a *-sys crate
Rust Closures in FFI
Rust FFI: Microsoft Flight Simulator SDK
- Part 1
- Part 2
Stack Overflow answers by Jake Goulding
ReadStat pull request to add MSVC/Windows support
jamovi-readstat appveyor.yml file to build ReadStat on Windows
Arrow documentation for utilizing ArrayBuilders

Building from Source

Minimum Supported Rust Version (MSRV)

All published crates require Rust 1.88 or newer (let-chains; Rust edition 2024), as declared by rust-version = "1.88" in each crate’s Cargo.toml.

Clone

Ensure submodules are also cloned.

git clone --recurse-submodules https://github.com/curtisalexander/readstat-rs.git

The ReadStat repository is included as a git submodule within this repository. In order to build and link, first a readstat-sys crate is created. Then the readstat library and readstat-cli binary crate utilize readstat-sys as a dependency.

Linux

Install developer tools

sudo apt install build-essential

Build

cargo build

iconv: Linked dynamically against the system-provided library. On most distributions it is available by default. No explicit link directives are emitted in the build script — the system linker resolves it automatically.

zlib: Linked via the libz-sys crate, which will use the system-provided zlib if available or compile from source as a fallback.

macOS

Install developer tools

xcode-select --install

Build

cargo build

iconv: Linked dynamically against the system-provided library that ships with macOS (via cargo:rustc-link-lib=iconv in the readstat-sys build script). No additional packages need to be installed.

zlib: Linked via the libz-sys crate, which will use the system-provided zlib that ships with macOS.

Windows

Building on Windows requires Visual Studio C++ Build tools be installed.

Build

cargo build

iconv: Compiled from source using the vendored libiconv-win-build submodule (located at crates/readstat-iconv-sys/vendor/libiconv-win-build/) via the readstat-iconv-sys crate. readstat-iconv-sys is a Windows-only dependency (gated behind [target.'cfg(windows)'.dependencies] in readstat-sys/Cargo.toml).

zlib: Compiled from source via the libz-sys crate (statically linked).

Regenerating bindings (maintainers only)

Default builds consume pre-generated bindings checked into crates/readstat-sys/src/bindings/bindings_<os>_<arch>.rs, so no libclang / LLVM install is required. If you need to regenerate the bindings (e.g. after bumping the vendored ReadStat sources or changing wrapper.h), enable the buildtime_bindgen feature on readstat-sys:

cargo build -p readstat-sys --features buildtime_bindgen

This invokes bindgen, which requires LLVM / libclang to be installed. On Windows specifically, you also need to set LIBCLANG_PATH (e.g. C:\Program Files\LLVM\lib). The build script writes the regenerated file to both OUT_DIR (for the current compile) and src/bindings/bindings_<os>_<arch>.rs (so the diff can be committed). Regeneration must be repeated on each supported target OS — the verify-bindings workflow can do this for you (workflow_dispatch → download artifacts → commit).

wasm32-unknown-emscripten builds require --features buildtime_bindgen because the emsdk sysroot can’t be reproduced from a checked-in file.

Linking Summary

Platform	iconv	zlib
Linux (glibc/musl)	Dynamic (system)	libz-sys (prefers system, falls back to source)
macOS (x86/ARM)	Dynamic (system)	libz-sys (uses system)
Windows (MSVC)	Static (vendored submodule)	libz-sys (compiled from source, static)

Usage

💡 Quick reference: A one-page visual CLI Cheatsheet is available for at-a-glance lookup of subcommands, flags, and common workflows. This page is the full reference and goes deeper on memory, parallelism, and metadata round-trips.

After either building or installing, the binary is invoked using subcommands. Currently, the following subcommands have been implemented:

metadata → writes the following to standard out or json
- row count
- variable count
- table name
- table label
- file encoding
- format version
- bitness
- creation time
- modified time
- compression
- byte order
- variable names
- variable type classes
- variable types
- variable labels
- variable format classes
- variable formats
- arrow data types
preview → writes the first 10 rows (or optionally the number of rows provided by the user) of parsed data in csv format to standard out
data → writes parsed data in csv, feather, ndjson, or parquet format to a file

Metadata

To write metadata to standard out, invoke the following.

readstat metadata /some/dir/to/example.sas7bdat

To write metadata to json, invoke the following. This is useful for reading the metadata programmatically.

readstat metadata /some/dir/to/example.sas7bdat --as-json

The JSON output contains file-level metadata and a vars object keyed by variable index. This makes it straightforward to search for a particular column by piping the output to jq or Python.

Skipping the row count

Computing the row count requires traversing the entire file. If only variable-level metadata is needed (names, types, labels, formats), pass --skip-row-count to short-circuit row enumeration:

readstat metadata /some/dir/to/example.sas7bdat --skip-row-count

In that mode the reported row count is 0 and parsing returns as soon as the header and variable definitions have been read.

Suppressing the progress bar

By default metadata, preview, and data render a progress bar while the file is being parsed. Pass --no-progress (available on all three subcommands) to suppress it — useful in CI logs, when piping output, or when launching readstat from another process.

Search for a column with `jq`

# Find the variable entry whose var_name matches "Make"
readstat metadata /some/dir/to/example.sas7bdat --as-json \
  | jq '.vars | to_entries[] | select(.value.var_name == "Make") | .value'

Search for a column with Python

# Find the variable entry whose var_name matches "Make"
readstat metadata /some/dir/to/example.sas7bdat --as-json \
  | python -c "
import json, sys
md = json.load(sys.stdin)
match = [v for v in md['vars'].values() if v['var_name'] == 'Make']
if match:
    print(json.dumps(match[0], indent=2))
"

Preview Data

To write parsed data (as a csv) to standard out, invoke the following (default is to write the first 10 rows).

readstat preview /some/dir/to/example.sas7bdat

To write the first 100 rows of parsed data (as a csv) to standard out, invoke the following.

readstat preview /some/dir/to/example.sas7bdat --rows 100

Data

📝 The data subcommand includes a parameter for --format, which is the file format that is to be written. Currently, the following formats have been implemented:

csv
feather
ndjson
parquet

By default data refuses to overwrite an existing output file. Pass --overwrite to replace it:

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --overwrite

`csv`

To write parsed data (as csv) to a file, invoke the following (default is to write all parsed data to the specified file).

The default --format is csv. Thus, the parameter is elided from the below examples.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.csv

To write the first 100 rows of parsed data (as csv) to a file, invoke the following.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.csv --rows 100

`feather`

To write parsed data (as feather) to a file, invoke the following (default is to write all parsed data to the specified file).

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.feather --format feather

To write the first 100 rows of parsed data (as feather) to a file, invoke the following.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.feather --format feather --rows 100

`ndjson`

To write parsed data (as ndjson) to a file, invoke the following (default is to write all parsed data to the specified file).

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.ndjson --format ndjson

To write the first 100 rows of parsed data (as ndjson) to a file, invoke the following.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.ndjson --format ndjson --rows 100

`parquet`

To write parsed data (as parquet) to a file, invoke the following (default is to write all parsed data to the specified file).

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet

To write the first 100 rows of parsed data (as parquet) to a file, invoke the following.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --rows 100

To write parsed data (as parquet) to a file with specific compression settings, invoke the following:

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --compression zstd --compression-level 3

Column Selection

Select specific columns to include when converting or previewing data.

Step 1: View available columns

readstat metadata /some/dir/to/example.sas7bdat

Or as JSON for programmatic use with jq:

readstat metadata /some/dir/to/example.sas7bdat --as-json \
  | jq '.vars | to_entries[] | .value.var_name'

Or with Python:

readstat metadata /some/dir/to/example.sas7bdat --as-json \
  | python -c "
import json, sys
md = json.load(sys.stdin)
for v in md['vars'].values():
    print(v['var_name'])
"

Step 2: Select columns on the command line

readstat data /some/dir/to/example.sas7bdat --output out.parquet --format parquet --columns Brand,Model,EngineSize

Step 2 (alt): Select columns from a file

Create columns.txt:

# Columns to extract from the dataset
Brand
Model
EngineSize

Then pass it to the CLI:

readstat data /some/dir/to/example.sas7bdat --output out.parquet --format parquet --columns-file columns.txt

Preview with column selection

readstat preview /some/dir/to/example.sas7bdat --columns Brand,Model,EngineSize

Parallelism

The data subcommand includes parameters for both parallel reading and parallel writing:

Parallel Reading (`--parallel`)

If invoked, the reading of a sas7bdat will occur in parallel. If the total rows to process is greater than stream-rows (if unset, the default rows to stream is 10,000), then each chunk of rows is read in parallel. Note that all processors on the user’s machine are used with the --parallel option. In the future, may consider allowing the user to throttle this number.

❗ Utilizing the --parallel parameter will increase memory usage — all chunks are read in parallel and collected in memory before being sent to the writer. In addition, because all processors are utilized, CPU usage may be maxed out during reading. Row ordering from the original sas7bdat is preserved.

Parallel Writing (`--parallel-write`)

When combined with --parallel, the --parallel-write flag enables parallel writing for Parquet format files. This can significantly improve write performance for large datasets by:

Writing record batches to temporary files in parallel using all available processors
Merging the temporary files into the final output
Using spooled temporary files that keep data in memory until a threshold is reached

Note: Parallel writing currently only supports the Parquet format. Other formats (CSV, Feather, NDJSON) will use optimized sequential writes with BufWriter.

Example usage:

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --parallel --parallel-write

Memory Buffer Size (`--parallel-write-buffer-mb`)

Controls the memory buffer size (in MB) before spilling to disk during parallel writes. Defaults to 100 MB. Valid range: 1-10240 MB.

Smaller buffers will cause data to spill to disk sooner, while larger buffers keep more data in memory. Choose based on your available memory and dataset size:

Small datasets (< 100 MB): Use default or larger buffer to keep everything in memory
Large datasets (> 1 GB): Consider smaller buffer (10-50 MB) to manage memory usage
Memory-constrained systems: Use smaller buffer (1-10 MB)

Example with custom buffer size:

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --parallel --parallel-write --parallel-write-buffer-mb 200

❗ Parallel writing may write batches out of order. This is acceptable for Parquet files as the row order is preserved when merged.

Memory Considerations

Default: Sequential Writes

In the default sequential write mode, a bounded channel (capacity 10) connects the reader thread to the writer. This means at most 10 chunks (each containing up to stream-rows rows) are held in memory at any time, providing natural backpressure when the writer is slower than the reader. For most workloads this keeps memory usage reasonable, but for very wide datasets (hundreds of columns, string-heavy) each chunk can be large — consider lowering --stream-rows if memory is a concern.

Sequential Write (default)
==========================

 Reader Thread                 Bounded Channel (cap 10)            Main Thread
+---------------------+       +------------------------+       +---------------------+
|                     |       |                        |       |                     |
| +-----------+       | send  | +--+--+--+--+--+--+   | recv  | +-------+           |
| | chunk  1  |-------|------>| |  |  |  |  |  |  |   |------>| | write |---> file   |
| +-----------+       |       | +--+--+--+--+--+--+   |       | +-------+           |
| +-----------+       | send  |    channel is full!    |       |                     |
| | chunk  2  |-------|------>| +--+--+--+--+--+--+--+|       | +-------+           |
| +-----------+       |       | |  |  |  |  |  |  |  ||       | | write |---> file   |
| +-----------+       |       | +--+--+--+--+--+--+--+|       | +-------+           |
| | chunk  3  |-------|-XXXXX |                        |       |                     |
| +-----------+       | BLOCK | writer drains a slot   |       | +-------+           |
|   ... waits ...     |       |    +--+--+--+--+--+--+ |       | | write |---> file   |
| | chunk  3  |-------|------>|    |  |  |  |  |  |  | |       | +-------+           |
| +-----------+       | ok!   |    +--+--+--+--+--+--+ |       |                     |
|                     |       |                        |       |                     |
+---------------------+       +------------------------+       +---------------------+

 Memory at any moment: <= 10 chunks in the channel + 1 being written
 Backpressure: reader blocks when channel is full

Parallel Writes (`--parallel-write`)

📝 --parallel-write: Uses bounded-batch processing — batches are pulled from the channel in groups (up to 10 at a time), written in parallel to temporary Parquet files, then the next group is pulled. This preserves the channel’s backpressure so that memory usage stays bounded rather than loading the entire dataset at once. All temporary files are merged into the final output at the end.

Parallel Write (--parallel --parallel-write)
============================================

 Reader Thread              Bounded Channel (cap 10)              Main Thread
+------------------+       +------------------------+       +-------------------------+
|                  |       |                        |       |                         |
| +----------+     | send  |                        | recv  |  Pull <= 10 batches     |
| | chunk  1 |-----|------>|  +-+-+-+-+-+-+-+-+-+-+ |------>|  +----+----+----+----+  |
| +----------+     |       |  | | | | | | | | | | | |       |  | b1 | b2 | .. | bN |  |
| +----------+     | send  |  +-+-+-+-+-+-+-+-+-+-+ |       |  +----+----+----+----+  |
| | chunk  2 |-----|------>|                        |       |    |    |         |      |
| +----------+     |       +------------------------+       |    v    v         v      |
| +----------+     |                                        |  Write in parallel      |
| | chunk  3 |-----|----> ...                               |  to temp .parquet files |
| +----------+     |                                        |    |    |         |      |
|     ...          |                                        |    v    v         v      |
|                  |                                        |  tmp_0 tmp_1 ... tmp_N   |
|                  |       +------------------------+       |                         |
| +----------+     | send  |                        | recv  |  Pull next <= 10        |
| | chunk 11 |-----|------>|  +-+-+-+-+-+-+-+-+-+-+ |------>|  +----+----+----+----+  |
| +----------+     |       |  | | | | | | | | | | | |       |  |b11 |b12 | .. | bM |  |
| +----------+     | send  |  +-+-+-+-+-+-+-+-+-+-+ |       |  +----+----+----+----+  |
| | chunk 12 |-----|------>|                        |       |    |    |         |      |
| +----------+     |       +------------------------+       |    v    v         v      |
|     ...          |                                        |  tmp_N+1  ...  tmp_M     |
+------------------+                                        |                         |
                                                            |  ... repeat until done  |
                                                            +-------------------------+
                                                                       |
                              +----------------------------------------+
                              |
                              v
                    +-------------------+       +--------------------+
                    |   Merge all temp  |       |                    |
                    |   .parquet files  |------>|  final output.pqt  |
                    |   in order        |       |                    |
                    +-------------------+       +--------------------+

 Memory at any moment: <= 10 chunks in channel + 10 being written
 Backpressure: preserved -- reader blocks while a batch group is being written

SQL Queries (`--sql` / `--sql-file`)

⚠️ --sql is feature-gated. SQL support is not enabled by default. The binary published via cargo install readstat-cli does not include it; to get --sql/--sql-file, install with the feature explicitly:

cargo install readstat-cli --features sql

Provide the query inline with --sql "SELECT ...", or point at a file containing the query with --sql-file path/to/query.sql. The table name is the input file stem (e.g. cars for cars.sas7bdat). --sql and --sql-file are mutually exclusive with each other and with --columns/--columns-file.

# inline query
readstat data cars.sas7bdat --output out.parquet --sql "SELECT make, mpg FROM cars WHERE mpg > 30"

# query from a file
readstat data cars.sas7bdat --output out.parquet --sql-file query.sql

SQL queries require the full dataset to be materialized in memory via DataFusion’s MemTable before query execution. For large files this may result in significant memory usage. Queries that filter rows (e.g. SELECT ... WHERE ...) will reduce the output size but the input must still be fully loaded.

SQL Query Mode (--sql "SELECT ...")
===================================

 Reader Thread              Bounded Channel              Main Thread
+------------------+       +---------------+       +---------------------------+
|                  |       |               |       |                           |
| +----------+     | send  |               | recv  |  Collect ALL batches      |
| | chunk  1 |-----|------>|               |------>|  into memory (required    |
| +----------+     |       |               |       |  by DataFusion MemTable)  |
| +----------+     | send  |               |       |                           |
| | chunk  2 |-----|------>|               |------>|  +-----+-----+-----+     |
| +----------+     |       |               |       |  |  b1 |  b2 | ... |     |
|     ...          |       |               |       |  +-----+-----+-----+     |
| +----------+     | send  |               |       |         |                 |
| | chunk  N |-----|------>|               |------>|         v                 |
| +----------+     |       |               |       |  +-------------+         |
+------------------+       +---------------+       |  |  DataFusion |         |
                                                   |  |  SQL Engine |         |
                                                   |  +-------------+         |
                                                   |         |                 |
                                                   |         v                 |
                                                   |  Write filtered results  |
                                                   |  to output file          |
                                                   +---------------------------+

 Memory at peak: ALL chunks in memory (no backpressure)
 This is inherent to SQL execution over in-memory tables.

Reading Metadata from Output Files

When converting to Parquet or Feather, readstat-rs preserves column metadata (labels, SAS format strings, and storage widths) as Arrow field metadata. Schema-level metadata includes the table label when present.

The following metadata keys may appear on each field:

Key	Description	Condition
`label`	User-assigned variable label	Non-empty
`sas_format`	SAS format string (e.g. `DATE9`, `BEST12`, `$30`)	Non-empty
`storage_width`	Number of bytes used to store the variable	Always
`display_width`	Display width hint from the file	Non-zero

Schema-level metadata:

Key	Description	Condition
`table_label`	User-assigned file label	Non-empty

Reading metadata with Python (pyarrow)

import pyarrow.parquet as pq

schema = pq.read_schema("example.parquet")

# Table-level metadata
print(schema.metadata.get(b"table_label", b"").decode())

# Per-column metadata
for field in schema:
    meta = field.metadata or {}
    print(f"{field.name}:")
    print(f"  label:         {meta.get(b'label', b'').decode()}")
    print(f"  sas_format:    {meta.get(b'sas_format', b'').decode()}")
    print(f"  storage_width: {meta.get(b'storage_width', b'').decode()}")
    print(f"  display_width: {meta.get(b'display_width', b'').decode()}")

Reading metadata with R (arrow)

library(arrow)

schema <- read_parquet("example.parquet", as_data_frame = FALSE)$schema

# Per-column metadata
for (field in schema) {
  cat(field$name, "\n")
  cat("  label:        ", field$metadata$label, "\n")
  cat("  sas_format:   ", field$metadata$sas_format, "\n")
  cat("  storage_width:", field$metadata$storage_width, "\n")
  cat("  display_width:", field$metadata$display_width, "\n")
}

Reader

The preview and data subcommands include a parameter for --reader. The possible values for --reader include the following.

mem → Parse and read the entire sas7bdat into memory before writing to either standard out or a file
stream (default) → Parse and read at most stream-rows into memory before writing to disk
- stream-rows may be set via the command line parameter --stream-rows or if elided will default to 10,000 rows

Why is this useful?

mem is useful for testing purposes
stream is useful for keeping memory usage low for large datasets (and hence is the default)
In general, users should not need to deviate from the default — stream — unless they have a specific need
In addition, by enabling these options as command line parameters hyperfine may be used to benchmark across an assortment of file sizes

Debug

Debug information is printed to standard out by setting the environment variable RUST_LOG=debug before the call to readstat.

⚠️ This is quite verbose! If using the preview or data subcommand, will write debug information for every single value!

# Linux and macOS
RUST_LOG=debug readstat ...

# Windows PowerShell
$env:RUST_LOG="debug"; readstat ...

Help

For full details run with --help.

readstat --help
readstat metadata --help
readstat preview --help
readstat data --help

CLI Cheatsheet

A one-page printable visual reference for the readstat CLI — subcommands, flags, Parquet compression options, parallelism, reader modes, column selection, metadata round-trips, and common workflows.

📄 Open the cheatsheet

The cheatsheet is intentionally high-level and complements (rather than replaces) the full CLI reference, which goes deeper on memory behaviour, parallel-write internals, and reading preserved metadata from Parquet/Feather output.

💡 The cheatsheet is also designed to print cleanly in landscape on a single page.

Architecture

Rust CLI tool and library that reads SAS binary files (.sas7bdat) and converts them to other formats (CSV, Feather, NDJSON, Parquet). Uses FFI bindings to the ReadStat C library for parsing, and Apache Arrow for in-memory representation and output.

Scope: The readstat-sys crate exposes the full ReadStat C API, which supports SAS (.sas7bdat, .xpt), SPSS (.sav, .zsav, .por), and Stata (.dta). However, the readstat, readstat-cli, and readstat-wasm crates currently only implement parsing and conversion for SAS .sas7bdat files.

Workspace Layout

readstat-rs/
├── Cargo.toml              # Workspace root (edition 2024, resolver 2)
├── crates/
│   ├── readstat/            # Library crate (parse SAS → Arrow, optional format writers)
│   ├── readstat-cli/        # Binary crate (CLI arg parsing, orchestration)
│   ├── readstat-sys/        # FFI bindings to ReadStat C library (bindgen)
│   ├── readstat-iconv-sys/   # FFI bindings to iconv (Windows only)
│   ├── readstat-tests/      # Integration test suite
│   └── readstat-wasm/       # WebAssembly build (excluded from workspace)
├── fuzz/                   # Fuzz testing (standalone Cargo project, cargo-fuzz)
│   ├── fuzz_targets/        # 3 libFuzzer targets
│   └── corpus/              # Seed corpus (14 .sas7bdat files per target)
├── examples/
│   ├── cli-demo/            # CLI conversion demo
│   ├── api-demo/            # REST API servers (Rust + Python)
│   ├── bun-demo/            # WASM usage from Bun/JS
│   ├── web-demo/            # Browser-based viewer and converter
│   └── sql-explorer/        # Browser-based SQL explorer (AlaSQL + WASM)
└── docs/

Crate Details

`readstat` (v0.24.0) — Library Crate

Path: crates/readstat/

Pure library for parsing SAS binary files into Arrow RecordBatch format. Output format writers (CSV, Feather, NDJSON, Parquet) are feature-gated.

Features: csv, feather, ndjson, parquet (all enabled by default), sql.

Key source modules in crates/readstat/src/:

Module	Purpose
`lib.rs`	Public API exports
`cb.rs`	C callback functions for ReadStat (handle_metadata, handle_variable, handle_value)
`rs_data.rs`	Data reading, Arrow RecordBatch conversion
`rs_metadata.rs`	Metadata extraction, Arrow schema building
`rs_parser.rs`	ReadStatParser wrapper around C parser
`rs_path.rs`	Input path validation
`rs_write_config.rs`	Output configuration (path, format, compression)
`rs_var.rs`	Variable types and value handling
`rs_write.rs`	Output writers (CSV, Feather, NDJSON, Parquet)
`progress.rs`	`ProgressCallback` trait for parsing progress reporting
`rs_query.rs`	SQL query execution via DataFusion (feature-gated)
`formats.rs`	SAS format detection (118 date/time/datetime formats, regex-based)
`err.rs`	Error enums: `ReadStatError` (14 variants) plus `ReadStatCError` (41 codes mapping the C library’s `readstat_error_t`)
`common.rs`	Utility functions
`rs_buffer_io.rs`	Buffer I/O operations

Key public types:

ReadStatData — coordinates FFI parsing, accumulates values directly into typed Arrow builders, produces Arrow RecordBatch. Internally it uses ColumnBuilder (a pub(crate) enum wrapping 12 typed Arrow builders — StringBuilder, Float64Builder, Date32Builder, etc.) to append values during FFI callbacks with zero intermediate allocation.
ReadStatMetadata — file-level metadata (row/var counts, encoding, compression, schema)
ReadStatWriter — writes output in requested format
ReadStatPath — validated input file path
WriteConfig — output configuration (path, format, compression)
OutFormat — output format enum (Csv, Feather, Ndjson, Parquet)
ProgressCallback — trait for receiving progress updates during parsing

Major dependencies: Arrow v58 ecosystem, Parquet (5 compression codecs, optional), Rayon, chrono, memmap2.

`readstat-cli` (v0.24.0) — CLI Binary

Path: crates/readstat-cli/

Binary crate producing the readstat CLI tool. Uses clap with three subcommands:

metadata — print file metadata (row/var counts, labels, encoding, etc.)
preview — preview first N rows
data — convert to output format (csv, feather, ndjson, parquet)

Owns CLI arg parsing, progress bars, colored output, and reader-writer thread orchestration.

Additional dependencies: clap v4, colored, indicatif, crossbeam, env_logger, path_abs.

`readstat-sys` (v0.4.1) — FFI Bindings

Path: crates/readstat-sys/

build.rs compiles ~49 C source files from vendor/ReadStat/ git submodule via the cc crate. Rust bindings are pre-generated per (os, arch) and checked in at crates/readstat-sys/src/bindings/bindings_<os>_<arch>.rs, so default builds need no libclang on any platform. Maintainers regenerate via cargo build -p readstat-sys --features buildtime_bindgen (requires libclang). Exposes the full ReadStat API including support for SAS, SPSS, and Stata formats. Platform-specific linking for iconv and zlib:

Platform	iconv	zlib	Notes
Windows (`windows-msvc`)	Static — compiled from vendored `readstat-iconv-sys` submodule	Static — compiled via `libz-sys` crate	`readstat-iconv-sys` is a `cfg(windows)` dependency
macOS (`apple-darwin`)	Dynamic — system `libiconv`	`libz-sys` (uses system zlib)	iconv linked via `cargo:rustc-link-lib=iconv`
Linux (gnu/musl)	Dynamic — system library	`libz-sys` (prefers system, falls back to source)	No explicit iconv link directives; system linker resolves automatically

Header include paths are propagated between crates using Cargo’s links key:

readstat-iconv-sys sets cargo:include=... which becomes DEP_ICONV_INCLUDE in readstat-sys
libz-sys sets cargo:include=... which becomes DEP_Z_INCLUDE in readstat-sys

`readstat-iconv-sys` (v0.3.1) — iconv FFI (Windows)

Path: crates/readstat-iconv-sys/

Windows-only (#[cfg(windows)]). Compiles libiconv from the vendor/libiconv-win-build/ git submodule using the cc crate, producing a static library. On non-Windows platforms the build script is a no-op. The links = "iconv" key in Cargo.toml allows readstat-sys to discover the include path via the DEP_ICONV_INCLUDE environment variable.

`readstat-wasm` (v0.1.0) — WebAssembly Build

Path: crates/readstat-wasm/

WebAssembly build of the readstat library for parsing SAS .sas7bdat files in JavaScript. Compiles the ReadStat C library and the Rust readstat library to WebAssembly via the wasm32-unknown-emscripten target. Excluded from the Cargo workspace (built separately with Emscripten).

Exports: read_metadata, read_metadata_fast, read_data (CSV), read_data_ndjson, read_data_parquet, read_data_feather, free_string, free_binary. Not published to crates.io (publish = false).

`readstat-tests` — Integration Tests

Path: crates/readstat-tests/

30 test modules covering: all SAS data types, 118 date/time/datetime formats, missing values, malformed UTF-8, large pages, CLI subcommands, parallel read/write, Parquet output, CSV output, Arrow migration, row offsets, scientific notation, column selection, skip row count, memory-mapped file reading, byte-slice reading, and SQL queries. Every sas7bdat file in the test data directory has both metadata and data reading tests.

Test data lives in tests/data/*.sas7bdat (14 datasets). SAS scripts to regenerate test data are in util/.

Dataset	Metadata Test	Data Test
`all_dates.sas7bdat`	✅	✅
`all_datetimes.sas7bdat`	✅	✅
`all_times.sas7bdat`	✅	✅
`all_types.sas7bdat`	✅	✅
`cars.sas7bdat`	✅	✅
`hasmissing.sas7bdat`	✅	✅
`intel.sas7bdat`	✅	✅
`malformed_utf8.sas7bdat`	✅	✅
`messydata.sas7bdat`	✅	✅
`rand_ds_largepage_err.sas7bdat`	✅	✅
`rand_ds_largepage_ok.sas7bdat`	✅	✅
`scientific_notation.sas7bdat`	✅	✅
`somedata.sas7bdat`	✅	✅
`somemiss.sas7bdat`	✅	✅

Build Prerequisites

Rust (edition 2024)
Git submodules must be initialized (git submodule update --init --recursive)
On Windows: MSVC toolchain
libclang is only required if regenerating bindings (--features readstat-sys/buildtime_bindgen) or building readstat-wasm

Key Architectural Patterns

FFI callback pattern: ReadStat C library calls Rust callbacks (cb.rs) during parsing; data accumulates in ReadStatData via raw pointer casts
Streaming: default reader streams rows in chunks (10k) to manage memory
Parallel processing: Rayon for parallel reading, Crossbeam channels for reader-writer coordination
Column filtering: optional --columns / --columns-file flags restrict parsing to selected variables; unselected values are skipped in the handle_value callback while row-boundary detection uses the original (unfiltered) variable count
Arrow pipeline: SAS data → typed Arrow builders (direct append in FFI callbacks) → Arrow RecordBatch → output format
Multiple I/O strategies: file path (default), memory-mapped files (memmap2), and in-memory byte slices — all feed into the same FFI parsing pipeline
Metadata preservation: SAS variable labels, format strings, and storage widths are persisted as Arrow field metadata, surviving round-trips through Parquet and Feather. See TECHNICAL.md for details.

Technical Details

Floating Point Values

⚠️ Decimal values are rounded to contain only 14 decimal digits!

For example, the number 1.1234567890123456 created within SAS would be returned as 1.12345678901235 within Rust.

Why does this happen? Is this an implementation error? No, rounding to only 14 decimal digits has been purposely implemented within the Rust code.

As a specific example, when testing with the cars.sas7bdat dataset (which was created originally on Windows), the numeric value 4.6 as observed within SAS was being returned as 4.600000000000001 (15 digits) within Rust. Values created on Windows with an x64 processor are only accurate to 15 digits.

For comparison, the ReadStat binary truncates to 14 decimal places when writing to csv.

Finally, SAS represents all numeric values in floating-point representation which creates a challenge for all parsed numerics!

Implementation: pure-arithmetic rounding

Rounding is performed using pure f64 arithmetic in cb.rs, avoiding any string formatting or heap allocation:

#![allow(unused)]
fn main() {
const ROUND_SCALE: f64 = 1e14;

fn round_decimal_f64(v: f64) -> f64 {
    if !v.is_finite() { return v; }
    let int_part = v.trunc();
    let frac_part = v.fract();
    let rounded_frac = (frac_part * ROUND_SCALE).round() / ROUND_SCALE;
    int_part + rounded_frac
}
}

The value is split into integer and fractional parts before scaling. This is necessary because large SAS datetime values (~1.9e9) multiplied directly by 1e14 would exceed f64’s exact integer range (2^53), causing precision loss. Since fract() is always in (-1, 1), fract() * 1e14 < 1e14 < 2^53, keeping the scaled value within the exact-integer range.

Why this is equivalent to the previous string roundtrip (format!("{:.14}") + lexical::parse): both approaches produce the nearest representable f64 to the value rounded to 14 decimal places. The tie-breaking rule (half-away-from-zero for .round() vs half-to-even for format!) is never exercised because every f64 is a dyadic rational (m / 2^k), and a true decimal midpoint would require an odd factor of 5 in the denominator — which is impossible for any f64 value.

Sources

How SAS Stores Numeric Values
Accuracy on x64 Windows Processors
- SAS on Windows with x64 processors can only represent 15 digits
Floating-point arithmetic may give inaccurate results in Excel
What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg, 1991)

Date, Time, and Datetimes

All 118+ recognized SAS date, time, and datetime formats are parsed appropriately. For the full list of supported formats, see sas_date_time_formats.md.

⚠️ If the format does not match a recognized SAS date, time, or datetime format, or if the value does not have a format applied, then the value will be parsed and read as a numeric value!

Details

SAS stores dates, times, and datetimes internally as numeric values. To distinguish among dates, times, datetimes, or numeric values, a SAS format is read from the variable metadata. If the format matches a recognized SAS date, time, or datetime format then the numeric value is converted and read into memory using one of the Arrow types:

Date32Type
Time32SecondType
Time64MicrosecondType — for time formats with microsecond precision (e.g. TIME15.6, decimal places 4–6)
TimestampSecondType
TimestampMillisecondType — for datetime formats with millisecond precision (e.g. DATETIME22.3, decimal places 1–3)
TimestampMicrosecondType — for datetime formats with microsecond precision (e.g. DATETIME22.6, decimal places 4–6)
TimestampNanosecondType — for datetime formats with nanosecond precision (e.g. DATETIME22.9, decimal places 7–9)

If values are read into memory as Arrow date, time, or datetime types, then when they are written — from an Arrow RecordBatch to csv, feather, ndjson, or parquet — they are treated as dates, times, or datetimes and not as numeric values.

Column Metadata in Arrow and Parquet

When converting to Parquet or Feather, readstat-rs persists column-level and table-level metadata into the Arrow schema. This metadata survives round-trips through Parquet and Feather files, allowing downstream consumers to recover SAS-specific information.

Metadata keys

Field (column) metadata

Key	Type	Description	Source formats
`label`	string	User-assigned variable label	SAS, SPSS, Stata
`sas_format`	string	SAS format string (e.g. `DATE9`, `BEST12`, `$30`)	SAS
`storage_width`	integer (as string)	Number of bytes used to store the variable value	All
`display_width`	integer (as string)	Display width hint from the file	XPORT, SPSS

Schema (table) metadata

Key	Type	Description
`table_label`	string	User-assigned file label

Storage width semantics

SAS numeric variables: always 8 bytes (IEEE 754 double-precision)
SAS string variables: equal to the declared character length (e.g. $30 → 30 bytes)
The storage_width field is always present in metadata

Display width semantics

sas7bdat files: typically 0 (not stored in the format)
XPORT files: populated from the format width
SPSS files: populated from the variable’s print/write format
The display_width field is only present in metadata when non-zero

SAS format strings and Arrow types

The SAS format string (e.g. DATE9, DATETIME22.3, TIME8) determines how a numeric variable is mapped to an Arrow type. The original format string is preserved in the sas_format metadata key, allowing downstream tools to reconstruct the original SAS formatting even after conversion.

For the full list of recognized SAS date, time, and datetime formats, see sas_date_time_formats.md.

Reading metadata from output files

See the Reading Metadata from Output Files section in the Usage guide for Python and R examples.

Testing

To perform unit / integration tests, run the following.

cargo test --workspace

To run only integration tests:

cargo test -p readstat-tests

Datasets

Formally tested (via integration tests) against the following datasets. See the README.md for data sources.

ahs2019n.sas7bdat → US Census data (download via download_ahs.sh or download_ahs.ps1)
all_dates.sas7bdat → SAS dataset containing all possible date formats
all_datetimes.sas7bdat → SAS dataset containing all possible datetime formats
all_times.sas7bdat → SAS dataset containing all possible time formats
all_types.sas7bdat → SAS dataset containing all SAS types
cars.sas7bdat → SAS cars dataset
hasmissing.sas7bdat → SAS dataset containing missing values
intel.sas7bdat
malformed_utf8.sas7bdat → SAS dataset with truncated multi-byte UTF-8 characters (issue #78)
messydata.sas7bdat
rand_ds_largepage_err.sas7bdat → Created using create_rand_ds.sas with BUFSIZE set to 2M
rand_ds_largepage_ok.sas7bdat → Created using create_rand_ds.sas with BUFSIZE set to 1M
scientific_notation.sas7bdat → Used to test float parsing
somedata.sas7bdat → Used to test Parquet label preservation
somemiss.sas7bdat

Fuzz Testing

Fuzz targets live in fuzz/ (a standalone Cargo project, not a workspace member) and use cargo-fuzz (libFuzzer). Requires nightly Rust.

Targets

Target	What it exercises
`fuzz_read_metadata`	Metadata + variable callbacks, format classification, schema building
`fuzz_read_data`	Full metadata→data pipeline including Arrow conversion
`fuzz_read_data_filtered`	Column filter index mapping, skipped-variable logic (uses `arbitrary`)

Each target’s corpus is seeded with the 14 test .sas7bdat files.

Running locally

# Install (one-time)
cargo install cargo-fuzz

# Run a target indefinitely (Ctrl+C to stop)
cargo +nightly fuzz run fuzz_read_metadata

# Run for 10 minutes
cargo +nightly fuzz run fuzz_read_metadata -- -max_total_time=600

# Reproduce a crash
cargo +nightly fuzz run fuzz_read_metadata fuzz/artifacts/fuzz_read_metadata/<crash-file>

CI

Fuzz tests run weekly (Monday 3am UTC) via .github/workflows/fuzz.yml. Each target runs for 30 minutes. On crash, a GitHub issue is automatically opened.

Valgrind

To ensure no memory leaks, valgrind may be utilized. For example, to ensure no memory leaks for the test parse_cars_md_test, run the following from the repository root.

valgrind ./target/debug/deps/parse_cars_md_test-<hash>

Memory Safety

This project contains unsafe Rust code (FFI callbacks, pointer casts, memory-mapped I/O) and links against the vendored ReadStat C library. Five automated CI checks guard against memory errors (the fifth is experimental and continue-on-error).

CI Jobs

All five jobs run on every workflow dispatch and tag push, in parallel with the build jobs. Any memory error fails the job with a nonzero exit code — except the experimental asan-windows-full job, which is marked continue-on-error and does not block the workflow.

Miri (Rust undefined behavior)

Platform: Ubuntu (Linux)
Scope: Unit tests in the readstat crate only (cargo miri test -p readstat)
What it catches: Undefined behavior in pure-Rust unsafe code — invalid pointer arithmetic, uninitialized reads, provenance violations, use-after-free in Rust allocations
Limitation: Cannot execute FFI calls into C code, so integration tests (readstat-tests) are excluded

Configuration:

Uses Rust nightly with the miri component
MIRIFLAGS="-Zmiri-disable-isolation" allows tests that use tempfile to create directories

AddressSanitizer — Linux

Platform: Ubuntu (Linux)
Scope: Full workspace — lib tests, integration tests, binary tests (cargo test --workspace --lib --tests --bins)
What it catches: Heap/stack buffer overflows, use-after-free, double-free, memory leaks (LeakSanitizer is enabled by default on Linux), across both Rust and C code

Configuration:

RUSTFLAGS="-Zsanitizer=address -Clinker=clang" — instruments Rust code and links the ASan runtime via clang
READSTAT_SANITIZE_ADDRESS=1 — triggers readstat-sys/build.rs to compile the ReadStat C library with -fsanitize=address -fno-omit-frame-pointer
Doctests are excluded (--lib --tests --bins) because rustdoc does not properly inherit sanitizer linker flags

AddressSanitizer — macOS

Platform: macOS (arm64)
Scope: Full workspace — lib tests, integration tests, binary tests
What it catches: Buffer overflows, use-after-free, double-free in Rust code and at the FFI boundary

Configuration:

RUSTFLAGS="-Zsanitizer=address" — instruments Rust code only
The ReadStat C library is not instrumented on macOS because Apple Clang and Rust’s LLVM have incompatible ASan runtimes — see ASan Runtime Mismatch below
LeakSanitizer is not supported on macOS
Doctests excluded for the same reason as Linux

AddressSanitizer — Windows

Platform: Windows (x86_64, MSVC toolchain)
Scope: Full workspace — lib tests, integration tests, binary tests
What it catches: Buffer overflows, use-after-free, double-free in Rust code and at the FFI boundary

Configuration:

RUSTFLAGS="-Zsanitizer=address" — instruments Rust code only
Rust on Windows MSVC uses Microsoft’s ASan runtime (from Visual Studio), not LLVM’s compiler-rt. The compiler passes /INFERASANLIBS to the MSVC linker, which auto-discovers the runtime import library at link time. See PR #118521.
Important: the MSVC ASan runtime DLL (clang_rt.asan_dynamic-x86_64.dll) is NOT on PATH by default. The linker finds the import library at build time via /INFERASANLIBS, but the DLL loader needs the DLL on PATH at test runtime. The CI job uses vswhere.exe to locate the DLL directory (e.g., C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\<ver>\bin\Hostx64\x64\) and prepends it to PATH.
LLVM is not installed by the Windows ASan job. Earlier versions installed it to satisfy bindgen’s libclang requirement, but readstat-sys now ships pre-generated bindings so default builds need neither. ASan itself uses Microsoft’s runtime, not LLVM’s.
This default job instruments Rust only. Unlike macOS, there is no runtime mismatch — both Rust and cl.exe use the same MSVC ASan runtime — so full C instrumentation is also exercised by a separate experimental job (below).
LeakSanitizer is not supported on Windows
Doctests excluded for the same reason as Linux

AddressSanitizer — Windows (full C + Rust, experimental)

Job: asan-windows-full — marked continue-on-error, so a failure does not block the workflow
Platform: Windows (x86_64, MSVC toolchain)
Scope: Full workspace, with the ReadStat C library also instrumented (READSTAT_SANITIZE_ADDRESS=1 → /fsanitize=address)
Why experimental: full Rust + C ASan on Windows MSVC should work since both use the same MSVC ASan runtime, but the combination is not widely documented as working — hence continue-on-error while it is validated. Once stable it would match Linux’s full C + Rust coverage (see Future Work).

How `READSTAT_SANITIZE_ADDRESS` Works

The readstat-sys/build.rs build script checks for the READSTAT_SANITIZE_ADDRESS environment variable. When set, it adds sanitizer flags to the C compiler flags for the ReadStat library only. This is intentionally scoped — a global CFLAGS would instrument third-party sys crates (e.g., zstd-sys) causing linker failures.

The flags are platform-specific:

Linux/macOS: -fsanitize=address -fno-omit-frame-pointer (GCC/Clang syntax)
Windows MSVC: /fsanitize=address (MSVC syntax)

The Linux CI job sets READSTAT_SANITIZE_ADDRESS=1 (validated, blocking) and the experimental asan-windows-full job sets it too (continue-on-error while being validated). macOS does not, because of the runtime mismatch described below.

ASan Runtime Mismatch (macOS)

macOS has an ASan runtime mismatch that prevents instrumenting the C code alongside Rust. Apple Clang is a fork of LLVM with its own ASan runtime versioning. When both Rust and the C library are instrumented, the linker sees two incompatible ASan runtimes and fails with ___asan_version_mismatch_check_apple_clang_* vs ___asan_version_mismatch_check_v8. A potential workaround is to install upstream LLVM via Homebrew (brew install llvm) and set CC=/opt/homebrew/opt/llvm/bin/clang so both the C code and Rust use the same LLVM ASan runtime. However, this is fragile — the Homebrew LLVM version must stay close to the LLVM version used by Rust nightly, which changes frequently.

Windows does NOT have this problem. Rust on x86_64-pc-windows-msvc uses Microsoft’s ASan runtime (PR #118521), and so does cl.exe /fsanitize=address. Both link the same clang_rt.asan_dynamic-x86_64.dll from Visual Studio. Full C + Rust ASan instrumentation is theoretically possible on Windows — see Future Work.

Bottom line: Linux has full C + Rust ASan coverage. macOS provides Rust-only coverage due to the Apple Clang runtime mismatch. Windows provides Rust-only coverage currently, but full coverage is a future improvement since there is no runtime mismatch.

Future Work: Windows C Instrumentation

Since Rust and MSVC share the same ASan runtime on Windows, enabling READSTAT_SANITIZE_ADDRESS=1 in the Windows CI job should allow full C + Rust instrumentation — matching Linux’s coverage. This requires:

Setting READSTAT_SANITIZE_ADDRESS=1 so readstat-sys/build.rs adds /fsanitize=address when compiling the ReadStat C library
Verifying there are no linker conflicts (if conflicts arise, the unstable -Zexternal-clangrt flag can tell Rust to skip linking its own runtime copy)
Ensuring the MSVC ASan runtime DLL is on PATH at test time (the CI job already does this via vswhere.exe)

Running Locally

Miri

rustup +nightly component add miri
MIRIFLAGS="-Zmiri-disable-isolation" cargo +nightly miri test -p readstat -- --skip property_tests

--skip property_tests matches CI: the proptest suites run 256 cases each and are 100–1000× slower under Miri’s interpreter. Everything else runs.

ASan on Linux

RUSTFLAGS="-Zsanitizer=address -Clinker=clang" \
READSTAT_SANITIZE_ADDRESS=1 \
cargo +nightly test --workspace --lib --tests --bins --target x86_64-unknown-linux-gnu

ASan on macOS

RUSTFLAGS="-Zsanitizer=address" \
cargo +nightly test --workspace --lib --tests --bins --target aarch64-apple-darwin

ASan on Windows

$env:RUSTFLAGS = "-Zsanitizer=address"
# The MSVC ASAN runtime DLL must be on PATH. Find it via vswhere:
$vsPath = & "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe" -latest -property installationPath
$msvcVer = (Get-ChildItem "$vsPath\VC\Tools\MSVC" | Sort-Object Name -Descending | Select-Object -First 1).Name
$env:PATH = "$vsPath\VC\Tools\MSVC\$msvcVer\bin\Hostx64\x64;$env:PATH"
cargo +nightly test --workspace --lib --tests --bins --target x86_64-pc-windows-msvc

Valgrind (Linux)

For manual checks with full C library coverage, valgrind can also be used against debug test binaries:

cargo test -p readstat-tests --no-run
valgrind ./target/debug/deps/parse_cars_md_test-<hash>

Coverage Summary

Tool	Platform	Rust code	C code (ReadStat)	Leak detection
Miri	Linux	Unit tests only	No (FFI excluded)	No
ASan	Linux	Full workspace	Yes (instrumented)	Yes
ASan	macOS	Full workspace	No (runtime mismatch)	No
ASan	Windows	Full workspace	Experimental (`asan-windows-full`, `continue-on-error` — see future work)	No
Valgrind	Linux (manual)	Full	Full	Yes
cargo-fuzz	Linux (CI, weekly)	Full	Full	No

Fuzz testing exercises the FFI byte-parsing paths with arbitrary/malformed input via libFuzzer. See TESTING.md for details.

Performance Benchmarking with Criterion

Overview

This document provides a comprehensive guide to performance benchmarking in readstat-rs using Criterion.rs.

Quick Start

# Run all benchmarks (from the repository root)
cargo bench -p readstat

# View HTML reports (Criterion writes to the workspace-root target/)
open target/criterion/report/index.html

What Gets Benchmarked

1. Reading Performance

Metadata Reading (~300-950 µs) - File header parsing
Single Chunk Reading - Full dataset read performance
Chunked Reading - Streaming with different chunk sizes (1K, 5K, 10K rows)

2. Data Conversion

Arrow Conversion - SAS types → Arrow RecordBatch overhead

3. Writing Performance

CSV Writing - Text format output
Parquet Compression - Uncompressed, Snappy, Zstd comparison
Format Comparison - CSV vs Parquet vs Feather vs NDJSON

4. Parallel Write Optimization

Buffer Sizes - SpooledTempFile memory thresholds (1MB, 10MB, 100MB, 500MB)

5. End-to-End Pipeline

Complete Conversion - Read + Write combined (most important)

Sample Results

From initial benchmark run (example output):

metadata_reading/all_types.sas7bdat
                        time:   [299.41 µs 301.84 µs 304.29 µs]

metadata_reading/cars.sas7bdat
                        time:   [935.21 µs 943.52 µs 952.41 µs]

read_single_chunk/cars.sas7bdat
                        time:   [~2-3 ms]
                        thrpt:  [~150-200K rows/sec]

write_parquet_compression/snappy
                        time:   [~4-6 ms]
                        thrpt:  [~70-100K rows/sec]

end_to_end_conversion/parquet
                        time:   [~6-9 ms]
                        thrpt:  [~50-70K rows/sec]

Interpreting Results

Understanding the Output

Time Measurement:

time: [299.41 µs 301.84 µs 304.29 µs]
       ^         ^         ^
       |         |         +-- Upper bound (95% confidence)
       |         +------------ Median
       +---------------------- Lower bound (95% confidence)

Throughput:

thrpt: [150K elem/s 175K elem/s 200K elem/s]
        ^           ^           ^
        |           |           +-- Upper bound
        |           +-------------- Median
        +-------------------------- Lower bound

Change Detection:

change: [-2.3456% -1.2345% +0.1234%] (p = 0.12 > 0.05)
         ^         ^         ^        ^
         |         |         |        +-- Statistical significance
         |         |         +----------- Upper bound of change
         |         +--------------------- Median change
         +------------------------------- Lower bound of change

What to Look For

🔴 Red Flags (Investigate)

High variance (>10%) - Results unreliable
Significant regression (>5% slower, p < 0.05)
Outliers (>5% of samples)

🟡 Opportunities

Chunked reading - Test if different chunk size improves throughput
Buffer sizes - If small buffer performs as well as large, save memory
Compression - If uncompressed only slightly faster, use compression

🟢 Validation

Low variance (<5%) - Reliable results
Improvements (>10% faster, p < 0.05)
Expected patterns (e.g., compression should be slower but smaller)

Performance Optimization Workflow

Step 1: Establish Baseline

# Save current performance as baseline
cargo bench --save-baseline main

# Results saved to target/criterion/{benchmark}/main/

Step 2: Make Changes

Edit code with optimization hypothesis:

Increase buffer size
Change algorithm
Add caching
Parallel processing

Step 3: Measure Impact

# Compare against baseline
cargo bench --baseline main

# Look for "change: [X% Y% Z%]" in output

Step 4: Analyze & Iterate

If improved (>10%, p < 0.05): ✅ Keep the change ✅ Update baseline: cargo bench --save-baseline main

If no change (<5%): ⚠️ Optimization didn’t help - profile to find real bottleneck

If regressed (slower): ❌ Revert change ❌ Investigate why performance decreased

Common Optimization Scenarios

Scenario 1: Slow Reading

Symptoms: read_single_chunk time is high

Investigate:

ReadStat C library overhead (FFI calls)
Memory allocation patterns
Callback overhead

Try:

Larger buffers in C library
Memory-mapped files (see evaluation doc)
Pre-allocate column vectors

Scenario 2: Slow Writing

Symptoms: write_formats time is high

Investigate:

BufWriter buffer size
Format-specific overhead
Compression CPU usage

Try:

Increase BufWriter capacity (currently 8KB)
Use faster compression (Snappy vs Zstd)
Parallel writing (already implemented)

Scenario 3: Memory Issues

Symptoms: System swapping, OOM errors

Investigate:

Chunk size too large
Too many parallel streams
Memory leaks

Try:

Reduce stream_rows (default 10,000)
Reduce parallel write buffer (default 100MB)
Use bounded channels (already implemented)

Scenario 4: High Variance

Symptoms: Large confidence intervals, many outliers

Investigate:

System background activity
CPU frequency scaling
Thermal throttling

Try:

Close background apps
Disable frequency scaling
Run on consistent power mode

Advanced Profiling

CPU Profiling with Flamegraphs

# Install flamegraph
cargo install flamegraph

# Profile a specific benchmark
cargo flamegraph --bench readstat_benchmarks -- --bench read_single_chunk

# Open flamegraph.svg to see hotspots

What to look for:

Wide bars = lots of time spent
Deep stacks = call overhead
Unexpected functions = bugs/inefficiency

Memory Profiling

# Using valgrind (Linux)
valgrind --tool=massif \
  cargo bench read_single_chunk --no-run
ms_print massif.out.* > memory_profile.txt

# Using heaptrack (Linux)
heaptrack cargo bench read_single_chunk
heaptrack_gui heaptrack.*.gz

System Call Tracing

# Linux: strace
strace -c cargo bench read_single_chunk 2>&1 | tail -20

# macOS: dtruss
sudo dtruss -c cargo bench read_single_chunk

Comparing Implementations

Before/After Memory-Mapped Files

# Baseline without mmap
git checkout main
cargo bench --save-baseline without-mmap

# With mmap implementation
git checkout feature/mmap
cargo bench --baseline without-mmap

# Look for improvements in read_single_chunk

Parallel vs Sequential

# Test with different parallelism settings
cargo bench end_to_end -- --parallel
cargo bench end_to_end -- --sequential

CI/CD Integration

Performance Regression Detection

Add to .github/workflows/benchmarks.yml:

name: Performance Benchmarks

on:
  pull_request:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Rust
        uses: dtolnay/rust-toolchain@stable

      - name: Run benchmarks
        run: |
          cd crates/readstat
          cargo bench --no-run  # Just compile for CI

      - name: Compare with baseline (on main branch)
        if: github.event_name == 'pull_request'
        run: |
          git fetch origin main:main
          git checkout main
          cargo bench --save-baseline main
          git checkout -
          cargo bench --baseline main

Best Practices

Do’s ✅

Run benchmarks on consistent hardware
Close background applications
Use --save-baseline for comparisons
Profile after benchmarking to find bottlenecks
Document performance changes in PRs
Test on representative data sizes

Don’ts ❌

Don’t benchmark on laptop (throttling)
Don’t optimize without profiling first
Don’t trust results with high variance
Don’t compare across different systems
Don’t commit benchmark artifacts
Don’t skip statistical significance checks

Performance Goals

Current Performance (Baseline)

Metadata reading: ~300-950 µs
Read throughput: ~150-200K rows/sec
Write throughput: ~70-100K rows/sec
End-to-end: ~50-70K rows/sec

Target Performance (Goals)

Metadata reading: <500 µs (↓30%)
Read throughput: >250K rows/sec (↑25%)
Write throughput: >100K rows/sec (↑30%)
End-to-end: >100K rows/sec (↑40%)

Stretch Goals

Memory-mapped reads: 2x faster for large files
Parallel writes: 3-4x speedup with 4+ cores
Compression: <10% overhead for Snappy

Data Files for Benchmarking

Current Test Data

all_types.sas7bdat - 3 rows, 10 vars (tiny)
cars.sas7bdat - 1081 rows, 13 vars (small)

Recommended Additional Data

For comprehensive benchmarking, consider adding:

Small (good for quick iteration):

< 1 MB file size
< 1,000 rows
5-10 variables

Medium (typical use case):

10-100 MB file size
10,000-100,000 rows
10-50 variables

Large (stress test):

1 GB file size
1,000,000 rows
50+ variables

Resources

Documentation

Tools

cargo-flamegraph
cargo-benchcmp
hyperfine - CLI benchmarking (see below)

Blog Posts

Next Steps

Run full benchmark suite: cargo bench
Review HTML reports: Open target/criterion/report/index.html
Identify bottlenecks: Look for slowest operations
Profile with flamegraph: Focus on hotspots
Implement optimizations: Test one at a time
Validate improvements: Compare against baseline
Document findings: Update this file with results

Questions?

See detailed README: crates/readstat/benches/README.md
Check Criterion docs: https://bheisler.github.io/criterion.rs/book/
Review performance evaluation: Memory-mapped files analysis (separate doc)

Benchmarking with hyperfine

Benchmarking performed with hyperfine.

This example compares the performance of the Rust binary with the performance of the C binary built from the ReadStat repository. In general, hope that performance is fairly close to that of the C binary.

To run, execute the following from within the readstat directory.

# Windows
hyperfine --warmup 5 "ReadStat_App.exe -f crates\readstat-tests\tests\data\cars.sas7bdat tests\data\cars_c.csv" ".\target\release\readstat.exe data crates\readstat-tests\tests\data\cars.sas7bdat --output crates\readstat-tests\tests\data\cars_rust.csv"

📝 First experiments on Windows are challenging to interpret due to file caching. Need further research into utilizing the --prepare option provided by hyperfine on Windows.

# Linux and macOS
hyperfine --prepare "sync; echo 3 | sudo tee /proc/sys/vm/drop_caches" "readstat -f crates/readstat-tests/tests/data/cars.sas7bdat crates/readstat-tests/tests/data/cars_c.csv" "./target/release/readstat data tests/data/cars.sas7bdat --output crates/readstat-tests/tests/data/cars_rust.csv"

Other, future, benchmarking may be performed now that channels and threads have been developed.

Profiling with Flamegraphs

Profiling performed with cargo flamegraph.

The readstat binary lives in the readstat-cli crate, so target it with -p readstat-cli. Run the following from the repository root.

cargo flamegraph -p readstat-cli --bin readstat -- data crates/readstat-tests/tests/data/_ahs2019n.sas7bdat --output crates/readstat-tests/tests/data/_ahs2019n.csv

Flamegraph is written to flamegraph.svg in the directory you run the command from (the repository root).

📝 Have yet to utilize flamegraphs in order to improve performance.

GitHub Actions

The CI/CD workflow can be triggered in multiple ways:

1. Tag Push (Release)

Push a tag to trigger a full release build with GitHub Release artifacts:

# add and commit local changes
git add .
git commit -m "commit msg"

# push local changes to remote
git push

# add local tag
git tag -a v0.1.0 -m "v0.1.0"

# push local tag to remote
git push origin --tags

To delete and recreate tags:

# delete local tag
git tag --delete v0.1.0

# delete remote tag
git push origin --delete v0.1.0

2. Manual Trigger (GitHub UI)

Trigger a build manually from the GitHub Actions web interface (build-only, no releases):

Go to the Actions tab
Select the readstat-rs workflow
Click Run workflow
Optionally specify:
- Version string: Label for artifacts (default: dev)

📝 Manual triggers only build artifacts and do not create GitHub releases. To create a release, use a tag push.

3. API Trigger (External Tools)

Trigger builds programmatically using the GitHub API. This is useful for automation tools like Claude Code.

Using `gh` CLI

# Trigger a build
gh api repos/curtisalexander/readstat-rs/dispatches \
  -f event_type=build

# Trigger a build with custom version label
gh api repos/curtisalexander/readstat-rs/dispatches \
  -f event_type=build \
  -F client_payload='{"version":"test-build-123"}'

Using `curl`

curl -X POST \
  -H "Authorization: token $GITHUB_TOKEN" \
  -H "Accept: application/vnd.github.v3+json" \
  https://api.github.com/repos/curtisalexander/readstat-rs/dispatches \
  -d '{"event_type": "build", "client_payload": {"version": "dev"}}'

4. Claude Code Integration

To have Claude Code trigger a CI build, use this prompt:

Trigger a CI build for readstat-rs by running: gh api repos/curtisalexander/readstat-rs/dispatches -f event_type=build

Event Types

Repository dispatch event types for API triggers:

Event Type	Description
`build`	Build all targets and upload artifacts
`test`	Same as `build` (alias for clarity)
`release`	Same as `build` (reserved for future use)

📝 API triggers only build artifacts and do not create GitHub releases. To create a release, use a tag push.

Fuzz Testing (`.github/workflows/fuzz.yml`)

A separate workflow runs cargo-fuzz (libFuzzer) targets against the readstat library’s byte-parsing paths.

Schedule: Weekly, Monday 3am UTC
Manual trigger: gh workflow run fuzz.yml or via the Actions UI
Duration: 30 minutes per target (~90 min total)
Targets: fuzz_read_metadata, fuzz_read_data, fuzz_read_data_filtered
On crash: uploads crash artifacts and automatically opens a GitHub issue labeled bug + fuzz

See TESTING.md for local usage and target details.

Artifacts

All builds (regardless of trigger method) upload artifacts that can be downloaded from the workflow run page. Artifacts are retained for the default GitHub Actions retention period.

readstat-sys Cross-Platform CI (`.github/workflows/readstat-sys-ci.yml`)

A separate workflow guards the FFI bindings. The readstat-sys and readstat-iconv-sys crates ship checked-in, per-target pre-generated bindings so that downstream builds need no libclang. This workflow proves those files are correct and reproducible. It runs on PRs / pushes touching crates/readstat-sys/**, crates/readstat-iconv-sys/**, Cargo.toml, Cargo.lock, or the workflow file, and supports workflow_dispatch.

Three jobs:

Job	Runs on	What it does
`consume`	linux x86_64/aarch64, macOS x86_64/aarch64, windows x86_64	Builds + tests the workspace using the committed `bindings_<os>_<arch>.rs` — the load-bearing check that each file matches that platform’s real ABI.
`regen`	same matrix	Regenerates each target’s bindings with `--features buildtime_bindgen`, uploads the result as artifact `bindings-<os>`, and fails on drift if it differs from the committed file.
`regen-iconv`	windows x86_64	Same idea for `readstat-iconv-sys`; artifact `iconv-bindings-windows`.

The checked-in files live in:

crates/readstat-sys/src/bindings/bindings_<os>_<arch>.rs (<os> ∈ linux/macos/windows, <arch> ∈ x86_64/aarch64)
crates/readstat-iconv-sys/src/bindings/bindings_windows_x86_64.rs

Updating bindgen (or the vendored C) — regenerating bindings

bindgen is exact-pinned in the workspace Cargo.toml (bindgen = "=x.y.z") because its output is the checked-in bindings; a different bindgen version can change that output. The exact pin means cargo update and scripts/check-updates.sh never bump it — it is always a deliberate, manual change, paired with regenerating every target’s bindings. The same procedure applies when you bump the vendored ReadStat / libiconv submodule and its C surface changes.

You can only regenerate your own host target locally (cross-compiling the others needs each platform’s toolchain + libclang, and Windows for iconv). So: verify locally, then let CI regenerate the rest.

Do locally:

Edit the pin in Cargo.toml: bindgen = "=<new-version>".
Regenerate + sanity-check your host target (needs libclang installed):
```
cargo build -p readstat-sys --features buildtime_bindgen
# On Windows, also:
cargo build -p readstat-iconv-sys --features buildtime_bindgen
```
The build script writes the regenerated file to both OUT_DIR and the checked-in src/bindings/bindings_<host-os>_<host-arch>.rs, so it shows up as a working-tree change.
Confirm it still works: cargo test --workspace.
Commit the Cargo.toml change together with your host target’s regenerated file. (The other targets will still be stale — that’s expected; CI produces them next.)

Let CI do (the targets you can’t build locally):

Push the branch. The regen matrix (5 targets) and regen-iconv run on real runners with libclang. For every target whose committed file you didn’t refresh, the drift check fails on purpose — that failure is the signal, and each job still uploads its freshly-generated file as an artifact (bindings-<os>, iconv-bindings-windows).
Download those artifacts, drop them into the two src/bindings/ directories above, commit, and push.
On the next run the regen / regen-iconv drift checks pass and the consume jobs build + test green on all platforms. The bindgen bump is complete.

💡 You can cut the regenerated files on demand without a PR via Actions → readstat-sys cross-platform CI → Run workflow (workflow_dispatch), then grab the artifacts.

scripts/check-updates.sh (and .ps1) print an advisory pointing here whenever a newer bindgen than the pin is available.

Releasing to crates.io

Step-by-step guide for publishing readstat-rs crates to crates.io.

Quick Reference

# 0. Merge the release PR, then pull main
git checkout main && git pull origin main

# 1. Run all pre-publish checks
./scripts/release-check.sh        # Linux/macOS
.\scripts\release-check.ps1       # Windows

# 2. Bump versions (updates Cargo.toml files, ARCHITECTURE.md, creates commit + tag)
cargo release minor -p readstat -p readstat-cli --dry-run   # preview first
cargo release minor -p readstat -p readstat-cli             # apply

# 3. Review the bump commit and tag, then push to trigger CI release builds
git push origin main --follow-tags

# 4. After CI builds the release artifacts, publish to crates.io:
#    Switch vendor dirs from submodules to copied files
./scripts/vendor.sh prepare       # Linux/macOS
.\scripts\vendor.ps1 prepare      # Windows

#    Publish (in dependency order)
cargo publish -p readstat-iconv-sys
cargo publish -p readstat-sys
cargo publish -p readstat
cargo publish -p readstat-cli

#    Restore submodules after publishing
./scripts/vendor.sh restore       # Linux/macOS
.\scripts\vendor.ps1 restore      # Windows

Install cargo-release once: cargo install cargo-release

Pre-Release Checklist

0. Check for Dependency Updates

./scripts/check-updates.sh              # report only (Linux/macOS)
./scripts/check-updates.sh --apply      # update safe deps in Cargo.lock
.\scripts\check-updates.ps1             # report only (Windows)
.\scripts\check-updates.ps1 -Apply      # update safe deps in Cargo.lock

This queries crates.io for outdated dependencies and their publish dates. Updates published less than 7 days ago (configurable via QUARANTINE_DAYS env var or -QuarantineDays parameter) are blocked to reduce supply chain risk.

The --apply / -Apply flag runs cargo update -p <crate> for each safe dependency, updating Cargo.lock within semver-compatible ranges. Major version bumps that require Cargo.toml edits are still manual.

1. Version Bumps

Use cargo-release (cargo install cargo-release). It updates all Cargo.toml version and dependency fields, substitutes the version strings in docs/ARCHITECTURE.md, and creates a single version-bump commit plus a git tag.

# Preview what will change (no files are modified)
cargo release minor -p readstat -p readstat-cli --dry-run

# Apply — updates Cargo.toml files, docs/ARCHITECTURE.md, commits, creates tag
cargo release minor -p readstat -p readstat-cli

Use patch / minor / major as appropriate. After running, verify the diff looks right, then push: git push origin main --follow-tags.

Version conventions:

readstat and readstat-cli share the same version (e.g. 0.21.0) — always bump together
readstat-sys and readstat-iconv-sys share the same version (e.g. 0.3.0)
Bump sys crates only when the vendored C library or bindings change: cargo release patch -p readstat-sys -p readstat-iconv-sys

2. Update CHANGELOG.md

Add an entry for the new version:

## [X.Y.Z] - YYYY-MM-DD

### Added
- ...

### Changed
- ...

### Fixed
- ...

3. Run Automated Checks

./scripts/release-check.sh

This runs:

cargo fmt --all -- --check — formatting
cargo clippy --workspace — linting
readstat-wasm fmt and clippy (excluded from workspace, checked separately)
cargo test --workspace — all tests
cargo doc --workspace --no-deps — documentation build
cargo deny check — license and security audit (if installed)
Version consistency checks
CHANGELOG entry check
cargo package dry-run for each publishable crate

Fix any failures before proceeding.

4. Manual Checks

README.md is up to date
Documentation reflects any API changes
Architecture docs (docs/ARCHITECTURE.md) are current
mdbook builds cleanly: ./scripts/build-book.sh
readstat-wasm builds and exports are up to date (excluded from workspace; not published to crates.io)

Vendor Preparation

The readstat-sys and readstat-iconv-sys crates vendor C source code from git submodules. cargo publish cannot include git submodule contents, so the files must be copied as regular files before publishing.

Switch to publish mode

./scripts/vendor.sh prepare       # Linux/macOS
.\scripts\vendor.ps1 prepare      # Windows

This:

Records submodule commit hashes in vendor-lock.txt
Copies only the files needed for building (matching Cargo.toml include patterns)
Deinitializes the git submodules
Places the copied files in the vendor directories

Verify package contents

cargo package --list -p readstat-sys --allow-dirty
cargo package --list -p readstat-iconv-sys --allow-dirty

Publishing

Crates must be published in dependency order. Wait for each crate to appear on the crates.io index before publishing the next one.

After vendor.sh prepare, the vendored C sources are copied in as regular (uncommitted) files and the submodules are deinitialized, so the working tree is dirty. The two *-sys crates bundle those files, so their publishes need --allow-dirty. (readstat and readstat-cli don’t carry vendored files, so they publish clean.)

# 1. No crate dependencies (carries vendored libiconv → --allow-dirty)
cargo publish -p readstat-iconv-sys --allow-dirty

# 2. Depends on readstat-iconv-sys (carries vendored ReadStat → --allow-dirty)
cargo publish -p readstat-sys --allow-dirty

# 3. Depends on readstat-sys
cargo publish -p readstat

# 4. Depends on readstat
cargo publish -p readstat-cli

Note: There may be a delay (30 seconds to a few minutes) between publishing a crate and it appearing in the index. If cargo publish fails with a dependency resolution error, wait and retry.

Post-Publish

1. Restore submodules

./scripts/vendor.sh restore       # Linux/macOS
.\scripts\vendor.ps1 restore      # Windows

2. Verify crates.io

Each published crate appears on crates.io within a few minutes:

https://crates.io/crates/readstat
https://crates.io/crates/readstat-cli
https://crates.io/crates/readstat-sys
https://crates.io/crates/readstat-iconv-sys

3. Verify docs.rs

docs.rs automatically builds documentation for every crate published to crates.io — no separate action is needed. The build is triggered by the crates.io publish and typically completes within 15–30 minutes.

The [package.metadata.docs.rs] section in crates/readstat/Cargo.toml instructs docs.rs to build with all features enabled and the docsrs cfg flag set, which causes feature-gated items to show their #[cfg(feature = "...")] badges.

Check build status and browse the rendered docs at:

https://docs.rs/readstat (build log: https://docs.rs/crate/readstat/latest/builds)

4. Verify the GitHub release

The tag push (step 3 of the Pre-Release Checklist above) already triggered CI to build platform binaries and create the GitHub Release. Confirm everything looks right on the Releases page.

5. Clean up

Remove vendor-lock.txt (or commit it for reference)

Troubleshooting

`cargo publish` fails with “no matching package found”

The dependency crate hasn’t appeared in the index yet. Wait 30-60 seconds and retry.

`cargo package` includes too many files

Check the include field in the crate’s Cargo.toml. Run cargo package --list to see exactly what will be included.

Vendor files missing after `vendor.sh restore`

Run git submodule update --init --recursive to re-initialize.

Build fails after switching vendor modes

Clean the build cache: cargo clean then rebuild.

readstat

Rust library for parsing SAS binary files (.sas7bdat) into Apache Arrow RecordBatch format. Parsing is performed via FFI bindings to the ReadStat C library; the resulting data is exposed through a safe, idiomatic Rust API.

Note: The ReadStat C library supports SAS, SPSS, and Stata file formats. The readstat-sys crate exposes the full ReadStat API — all 125 functions across all formats. However, this crate currently only implements parsing and conversion for SAS .sas7bdat files. SPSS and Stata formats are not supported.

Minimum Supported Rust Version (MSRV): 1.88 (Rust edition 2024).

Quick Start

Read an entire file into a single Arrow RecordBatch:

fn main() -> Result<(), readstat::ReadStatError> {
    let batch = readstat::read_to_batch("data.sas7bdat")?;
    println!("{} rows x {} columns", batch.num_rows(), batch.num_columns());
    Ok(())
}

Or read just the file/variable metadata, without loading any rows:

fn main() -> Result<(), readstat::ReadStatError> {
    let md = readstat::read_metadata("data.sas7bdat")?;
    println!("{} rows x {} columns", md.row_count, md.var_count);
    Ok(())
}

For streaming large files in chunks, parallel reads, and column filtering, use the ReadStatPath / ReadStatMetadata / ReadStatData types directly — see the crate documentation.

Features

Output format writers are feature-gated (all enabled by default):

csv — CSV output via arrow-csv
parquet — Parquet output (Snappy, Zstd, Brotli, Gzip, Lz4 compression)
feather — Arrow IPC / Feather format
ndjson — Newline-delimited JSON
sql — DataFusion SQL query support (optional, not enabled by default)

Key Types

ReadStatData — Coordinates FFI parsing, accumulates values directly into typed Arrow builders
ReadStatMetadata — File-level metadata (row/var counts, encoding, compression, schema)
ReadStatWriter — Writes Arrow batches to the requested output format
ReadStatPath — Validated input file path
WriteConfig — Output configuration (path, format, compression)

For the full architecture overview, see docs/ARCHITECTURE.md.

readstat-cli

Binary crate producing the readstat CLI tool for converting SAS binary files (.sas7bdat) to other formats.

Note: The ReadStat C library supports SAS, SPSS, and Stata file formats. The readstat-sys crate exposes the full ReadStat API — all 125 functions across all formats. However, this CLI currently only supports SAS .sas7bdat files. SPSS and Stata formats are not supported.

Subcommands

metadata — Print file metadata (row/var counts, labels, encoding, format version, etc.)
preview — Preview first N rows as CSV to stdout
data — Convert to output format (csv, feather, ndjson, parquet)

Key Features

Column selection (--columns, --columns-file)
Streaming reads with configurable chunk size (--stream-rows)
Parallel reading (--parallel) and parallel Parquet writing (--parallel-write)
SQL queries via DataFusion (--sql, feature-gated)
Parquet compression settings (--compression, --compression-level)

Documentation

CLI Cheatsheet — one-page printable overview of subcommands, flags, and common workflows
Full CLI reference (docs/USAGE.md) — complete documentation with memory diagrams and metadata round-trip examples

readstat-sys

Raw FFI bindings to the ReadStat C library.

The build.rs script compiles ~49 C source files from the vendored vendor/ReadStat/ git submodule via the cc crate. Platform-specific linking for iconv and zlib is handled automatically (see docs/BUILDING.md for details).

These bindings expose the full ReadStat API — all 125 functions and all 8 enum types — including support for SAS (.sas7bdat, .xpt), SPSS (.sav, .zsav, .por), and Stata (.dta) file formats. If you need to work with SPSS or Stata files from Rust, this crate provides the complete FFI surface to do so.

This is a sys crate — it exposes raw C types and functions. The higher-level readstat library crate provides a safe API but currently only implements support for SAS .sas7bdat files.

Vendored ReadStat version

This crate vendors the ReadStat C sources directly into the published package, so consumers do not need the git submodule. The current pin is:

ReadStat v1.1.9-50-g3add3a5 (commit 3add3a5)

Because the crate ships the C as real files (not a submodule reference), the published version on crates.io is self-contained; the submodule pointer is not visible to downstream consumers, which is why the vendored revision is recorded here.

Bindings and `libclang`

Rust bindings are pre-generated per (os, arch) and checked in under src/bindings/bindings_<os>_<arch>.rs. The default build simply copies the file matching the current target, so building this crate requires no libclang on any of the supported targets (Linux x86_64/aarch64, macOS x86_64/aarch64, Windows x86_64).

Targets without a checked-in bindings file (e.g. wasm32-unknown-emscripten) must enable the buildtime_bindgen feature, which regenerates bindings from wrapper.h at build time and requires libclang. Maintainers also use this feature to refresh the checked-in files when the vendored C surface changes:

cargo build -p readstat-sys --features buildtime_bindgen

API Coverage

All 125 public C functions and all 8 enum types from readstat.h are bound. All 49 library source files are compiled.

Functions by Category

Category	Count	Formats
Metadata accessors	15	All
Value accessors	14	All
Variable accessors	14	All
Parser lifecycle	3	All
Parser callbacks	7	All
Parser I/O handlers	6	All
Parser config	4	All
File parsers (readers)	10	SAS (`sas7bdat`, `sas7bcat`, `xport`), SPSS (`sav`, `por`), Stata (`dta`), text schema (`sas_commands`, `spss_commands`, `stata_dictionary`, `txt`)
Schema parsing	1	All
Writer lifecycle	3	All
Writer label sets	5	All
Writer variable definition	11	All
Writer notes/strings	3	All
Writer metadata setters	8	All
Writer begin	6	SAS (`sas7bdat`, `sas7bcat`, `xport`), SPSS (`sav`, `por`), Stata (`dta`)
Writer validation	2	All
Writer row insertion	12	All
Error handling	1	All
Total	125

Compiled Source Files

Directory	Files	Description
`src/` (core)	11	Hash table, parser, value/variable handling, writer, I/O, error
`src/sas/`	11	SAS7BDAT, SAS7BCAT, XPORT read/write, IEEE float, RLE compression
`src/spss/`	16	SAV, POR, ZSAV read/write, compression, SPSS parsing
`src/stata/`	4	DTA read/write, timestamp parsing
`src/txt/`	7	SAS commands, SPSS commands, Stata dictionary, plain text, schema
Total	49

Enum Types

C Enum	Rust Type Alias	Description
`readstat_type_e`	`readstat_type_e`	Data types (string, int8/16/32, float, double, string_ref)
`readstat_type_class_e`	`readstat_type_class_e`	Type classes (string, numeric)
`readstat_measure_e`	`readstat_measure_e`	Measurement levels (nominal, ordinal, scale)
`readstat_alignment_e`	`readstat_alignment_e`	Column alignment (left, center, right)
`readstat_compress_e`	`readstat_compress_e`	Compression types (none, rows, binary)
`readstat_endian_e`	`readstat_endian_e`	Byte order (big, little)
`readstat_error_e`	`readstat_error_e`	Error codes (41 variants)
`readstat_io_flags_e`	`readstat_io_flags_e`	I/O flags

Verifying Bindings

To confirm that the Rust bindings stay in sync with the vendored C header and source files, run the verification script:

# Bash (Linux, macOS, Windows Git Bash)
bash crates/readstat-sys/verify_bindings.sh

# Rebuild first, then verify
bash crates/readstat-sys/verify_bindings.sh --rebuild

# PowerShell (Windows)
.\crates\readstat-sys\verify_bindings.ps1

# Rebuild first, then verify
.\crates\readstat-sys\verify_bindings.ps1 -Rebuild

The script checks three things:

Every function declared in readstat.h has a pub fn binding in the generated bindings.rs
Every typedef enum in the header has a corresponding Rust type alias
Every .c library source file in the vendor directory is listed in build.rs

Run this after updating the ReadStat submodule to catch any new or removed API surface.

readstat-iconv-sys

Windows-only FFI bindings to libiconv for character encoding conversion.

The build.rs script compiles libiconv from the vendored vendor/libiconv-win-build/ git submodule using the cc crate. On non-Windows platforms the build script is a no-op.

The links = "iconv" key in Cargo.toml allows readstat-sys to discover the include path via the DEP_ICONV_INCLUDE environment variable.

License

This crate is LGPL-2.1-or-later AND MIT: it statically links the vendored libiconv, which is LGPL-2.1-or-later, into Windows builds. Distributors of Windows binaries built with this crate are subject to the LGPL §6 relinking obligation. On non-Windows platforms the build script is a no-op and the crate links nothing.

readstat-tests

Integration test suite for the readstat library and readstat-cli binary.

Contains 30 test modules covering all SAS data types, 118 date/time/datetime formats, missing values, large pages, CLI subcommands, parallel read/write, Parquet output, CSV output, Arrow migration, row offsets, scientific notation, column selection, skip row count, memory-mapped file reading, byte-slice reading, and SQL queries.

Test data lives in tests/data/*.sas7bdat (14 datasets). SAS scripts to regenerate test data are in util/.

Run with:

cargo test -p readstat-tests

readstat-wasm

WebAssembly build of the readstat library for parsing SAS .sas7bdat files in JavaScript. Reads metadata and converts row data to CSV, NDJSON, Parquet, or Feather (Arrow IPC) entirely in memory — no server or native dependencies required at runtime.

Package contents

The pkg/ directory contains everything needed to use the library from JavaScript:

File	Description
`readstat_wasm.wasm`	Pre-built WASM binary (Emscripten target)
`readstat_wasm.js`	JS wrapper handling module loading, memory management, and type conversion

JS API

All functions accept a Uint8Array of raw .sas7bdat file bytes.

import { init, read_metadata, read_metadata_fast, read_data, read_data_ndjson, read_data_parquet, read_data_feather } from "readstat-wasm";

// Must be called once before using any other function
await init();

const bytes = new Uint8Array(/* .sas7bdat file contents */);

// Metadata (returns JSON string)
const metadataJson = read_metadata(bytes);
const metadataJsonFast = read_metadata_fast(bytes); // skips full row count

// Data as text (returns string)
const csv = read_data(bytes);       // CSV with header row
const ndjson = read_data_ndjson(bytes); // newline-delimited JSON

// Data as binary (returns Uint8Array)
const parquet = read_data_parquet(bytes);  // Parquet bytes
const feather = read_data_feather(bytes);  // Feather (Arrow IPC) bytes

Functions

Function	Returns	Description
`init()`	`Promise<void>`	Load and initialize the WASM module
`read_metadata(bytes)`	`string`	File and variable metadata as JSON
`read_metadata_fast(bytes)`	`string`	Same as above but skips full row count for speed
`read_data(bytes)`	`string`	All row data as CSV (with header)
`read_data_ndjson(bytes)`	`string`	All row data as newline-delimited JSON
`read_data_parquet(bytes)`	`Uint8Array`	All row data as Parquet bytes
`read_data_feather(bytes)`	`Uint8Array`	All row data as Feather (Arrow IPC) bytes

How it works

The crate compiles the ReadStat C library and the Rust readstat parsing library to WebAssembly using the wasm32-unknown-emscripten target. Emscripten is required because the underlying C code needs a C standard library (libc, iconv).

The data functions perform a two-pass parse over the byte buffer: first to extract metadata (schema, row count), then to read row values into an Arrow RecordBatch, which is serialized to CSV, NDJSON, Parquet, or Feather in memory.

C ABI exports

The WASM module exposes these C-compatible functions (used internally by the JS wrapper):

Export	Signature	Purpose
`read_metadata`	`(ptr, len) -> *char`	Parse metadata as JSON
`read_metadata_fast`	`(ptr, len) -> *char`	Same, skipping full row count
`read_data`	`(ptr, len) -> *char`	Parse data, return as CSV
`read_data_ndjson`	`(ptr, len) -> *char`	Parse data, return as NDJSON
`read_data_parquet`	`(ptr, len, out_len) -> *u8`	Parse data, return as Parquet bytes
`read_data_feather`	`(ptr, len, out_len) -> *u8`	Parse data, return as Feather bytes
`free_string`	`(ptr)`	Free a string returned by the above
`free_binary`	`(ptr, len)`	Free a binary buffer returned by parquet/feather

Building from source

Requires Rust, Emscripten SDK, and libclang.

# Activate Emscripten
source /path/to/emsdk/emsdk_env.sh

# Add the target (first time only)
rustup target add wasm32-unknown-emscripten

# Initialize submodules (first time only, from repo root)
git submodule update --init --recursive

# Build
cargo build --target wasm32-unknown-emscripten --release

# Copy binary to pkg/
cp target/wasm32-unknown-emscripten/release/readstat_wasm.wasm pkg/

See the bun-demo for a working example.

readstat CLI Demo

Demonstrates converting a SAS .sas7bdat file to CSV, NDJSON, Parquet, and Feather using the readstat command-line tool.

Quick start

Linux / macOS

# Build the CLI (from repo root)
cargo build -p readstat-cli

# Run the conversion script
cd examples/cli-demo
bash convert.sh

# Verify the output files
uv run verify_output.py

You can also pass a specific path to the readstat binary:

bash convert.sh /path/to/readstat

Windows (PowerShell)

# Build the CLI (from repo root)
cargo build -p readstat-cli

# Run the conversion script
cd examples/cli-demo
./convert.ps1

# Verify the output files
uv run verify_output.py

You can also pass a specific path to the readstat binary:

./convert.ps1 -ReadStat C:\path\to\readstat.exe

What it does

The convert.sh (Bash) and convert.ps1 (PowerShell) scripts:

Displays metadata for the cars.sas7bdat dataset (table name, encoding, row count, variable info)
Previews the first 5 rows of data
Converts the dataset to four output formats:
- cars.csv — comma-separated values
- cars.ndjson — newline-delimited JSON
- cars.parquet — Apache Parquet (columnar binary)
- cars.feather — Arrow IPC / Feather (columnar binary)

The verify_output.py script validates all output files:

Checks row and column counts match the expected 1,081 rows x 13 columns
Verifies column names are correct
Confirms cross-format consistency (all four formats contain identical data)

The cars dataset

Property	Value
Rows	1,081
Columns	13
Source	`crates/readstat-tests/tests/data/cars.sas7bdat`
Encoding	WINDOWS-1252

Columns: Brand, Model, Minivan, Wagon, Pickup, Automatic, EngineSize, Cylinders, CityMPG, HwyMPG, SUV, AWD, Hybrid

Expected output

Using readstat: /path/to/readstat
Input file:     /path/to/cars.sas7bdat

=== Metadata ===
...

=== Preview (first 5 rows) ===
...

Converting to CSV...
  -> cars.csv
Converting to NDJSON...
  -> cars.ndjson
Converting to Parquet...
  -> cars.parquet
Converting to Feather...
  -> cars.feather

Done! All output files written to /path/to/examples/cli-demo
Run 'uv run verify_output.py' to validate the output files.

API Server Demo

Two identical API servers demonstrating how to integrate readstat into backend applications:

Rust server (Axum) — direct library integration
Python server (FastAPI) — cross-language integration via PyO3/maturin bindings

Both servers expose the same endpoints and return identical results for the same input.

Prerequisites

Rust server:

Rust toolchain
Git submodules initialized: git submodule update --init --recursive

Python server:

Everything above, plus:
uv (Python package manager)
Python 3.9+

Quick Start

Rust Server (port 3000)

cd examples/api-demo/rust-server
cargo run

You should see:

Rust API server listening on http://localhost:3000

Python Server (port 3001)

cd examples/api-demo/python-server

# Build the PyO3 bindings into the project venv
uv sync
uv run maturin develop -m readstat_py/Cargo.toml

# Start the server
uv run uvicorn server:app --port 3001

You should see:

INFO:     Started server process [...]
INFO:     Uvicorn running on http://127.0.0.1:3001 (Press CTRL+C to quit)

Walking Through the Endpoints

The examples below use port 3000 (Rust server). Replace with 3001 for the Python server — the responses are identical.

Set a convenience variable for the test file:

FILE=test-data/cars.sas7bdat

1. Health Check

curl http://localhost:3000/health

Expected output:

{"status":"ok"}

2. File Metadata

Upload a SAS file and get back its metadata as JSON:

curl -F "file=@$FILE" http://localhost:3000/metadata

Expected output (formatted):

{
  "row_count": 1081,
  "var_count": 13,
  "table_name": "CARS",
  "file_label": "Written by SAS",
  "file_encoding": "WINDOWS-1252",
  "version": 9,
  "is64bit": 0,
  "creation_time": "2008-09-30 12:55:01",
  "modified_time": "2008-09-30 12:55:01",
  "compression": "None",
  "endianness": "Little",
  "vars": {
    "0": {
      "var_name": "Brand",
      "var_type": "String",
      "var_type_class": "String",
      "var_label": "",
      "var_format": "",
      "var_format_class": null,
      "storage_width": 13,
      "display_width": 0
    },
    "1": {
      "var_name": "Model",
      "var_type": "String",
      "var_type_class": "String",
      ...
    },
    ...
  }
}

The vars map is keyed by column index and includes type info, labels, and SAS format metadata for all 13 variables.

3. Preview Rows

Get the first N rows as CSV (default 10, here we ask for 5):

curl -F "file=@$FILE" "http://localhost:3000/preview?rows=5"

Expected output:

Brand,Model,Minivan,Wagon,Pickup,Automatic,EngineSize,Cylinders,CityMPG,HwyMPG,SUV,AWD,Hybrid
TOYOTA,Prius,0.0,0.0,0.0,1.0,1.5,4.0,60.0,51.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,1.0,1.3,4.0,48.0,47.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,1.0,1.3,4.0,47.0,48.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,0.0,1.3,4.0,46.0,51.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,0.0,1.3,4.0,45.0,51.0,0.0,0.0,1.0

4. Convert to CSV

Export the full dataset (all 1,081 rows) as CSV:

curl -F "file=@$FILE" "http://localhost:3000/data?format=csv" -o output.csv

The response has Content-Type: text/csv and Content-Disposition: attachment; filename="data.csv".

5. Convert to NDJSON

Export as newline-delimited JSON (one JSON object per row):

curl -F "file=@$FILE" "http://localhost:3000/data?format=ndjson"

Expected output (first few lines):

{"Brand":"TOYOTA","Model":"Prius","Minivan":0.0,"Wagon":0.0,"Pickup":0.0,"Automatic":1.0,"EngineSize":1.5,"Cylinders":4.0,"CityMPG":60.0,"HwyMPG":51.0,"SUV":0.0,"AWD":0.0,"Hybrid":1.0}
{"Brand":"HONDA","Model":"Civic Hybrid","Minivan":0.0,"Wagon":0.0,"Pickup":0.0,"Automatic":1.0,"EngineSize":1.3,"Cylinders":4.0,"CityMPG":48.0,"HwyMPG":47.0,"SUV":0.0,"AWD":0.0,"Hybrid":1.0}
{"Brand":"HONDA","Model":"Civic Hybrid","Minivan":0.0,"Wagon":0.0,"Pickup":0.0,"Automatic":1.0,"EngineSize":1.3,"Cylinders":4.0,"CityMPG":47.0,"HwyMPG":48.0,"SUV":0.0,"AWD":0.0,"Hybrid":1.0}
...

The response has Content-Type: application/x-ndjson.

6. Convert to Parquet

Export as Apache Parquet (binary, Snappy-compressed):

curl -F "file=@$FILE" "http://localhost:3000/data?format=parquet" -o output.parquet

This produces a ~15 KB Parquet file. You can inspect it with tools like parquet-tools, DuckDB, or pandas:

import pandas as pd
print(pd.read_parquet("output.parquet").head())

7. Convert to Feather

Export as Arrow IPC (Feather v2) format:

curl -F "file=@$FILE" "http://localhost:3000/data?format=feather" -o output.feather

This produces a ~130 KB Feather file. Read it back with any Arrow-compatible tool:

import pandas as pd
print(pd.read_feather("output.feather").head())

Automated Test Scripts

Both scripts work against either server — just change the URL.

Shell script (curl)

cd examples/api-demo
bash client/test_api.sh http://localhost:3000 test-data/cars.sas7bdat
bash client/test_api.sh http://localhost:3001 test-data/cars.sas7bdat

Python script (httpx)

Uses PEP 723 inline script metadata, so uv run handles dependencies automatically — no virtual environment setup needed:

cd examples/api-demo/client
uv run test_api.py http://localhost:3000 ../test-data/cars.sas7bdat
uv run test_api.py http://localhost:3001 ../test-data/cars.sas7bdat

Expected output:

=== Testing http://localhost:3000 with ../test-data/cars.sas7bdat ===

--- GET /health ---
{'status': 'ok'}

--- POST /metadata ---
  row_count: 1081
  var_count: 13
  table_name: CARS
  encoding: WINDOWS-1252
  variables: 13

--- POST /preview (5 rows) ---
  Brand,Model,Minivan,Wagon,Pickup,Automatic,EngineSize,Cylinders,CityMPG,HwyMPG,SUV,AWD,Hybrid
  TOYOTA,Prius,0.0,0.0,0.0,1.0,1.5,4.0,60.0,51.0,0.0,0.0,1.0
  ...

--- POST /data?format=csv ---
  Brand,Model,Minivan,Wagon,Pickup,Automatic,EngineSize,Cylinders,CityMPG,HwyMPG,SUV,AWD,Hybrid
  TOYOTA,Prius,0.0,0.0,0.0,1.0,1.5,4.0,60.0,51.0,0.0,0.0,1.0
  HONDA,Civic Hybrid,0.0,0.0,0.0,1.0,1.3,4.0,48.0,47.0,0.0,0.0,1.0

--- POST /data?format=ndjson ---
  {"Brand":"TOYOTA","Model":"Prius","Minivan":0.0,...}
  ...

--- POST /data?format=parquet ---
  15403 bytes

--- POST /data?format=feather ---
  129650 bytes

=== All tests passed ===

API Reference

Method	Path	Request	Response	Content-Type
`GET`	`/health`	—	`{"status": "ok"}`	`application/json`
`POST`	`/metadata`	multipart `file`	JSON metadata	`application/json`
`POST`	`/preview?rows=N`	multipart `file`	CSV text (first N rows, default 10)	`text/csv`
`POST`	`/data?format=csv`	multipart `file`	Full dataset as CSV	`text/csv`
`POST`	`/data?format=ndjson`	multipart `file`	Full dataset as NDJSON	`application/x-ndjson`
`POST`	`/data?format=parquet`	multipart `file`	Full dataset as Parquet	`application/octet-stream`
`POST`	`/data?format=feather`	multipart `file`	Full dataset as Feather	`application/octet-stream`

The multipart field name must be file. Binary formats include a Content-Disposition header with a suggested filename.

How It Works

Rust Server

HTTP upload → Axum multipart extraction → Vec<u8>
  → spawn_blocking {
      ReadStatMetadata::read_metadata_from_bytes()
      ReadStatData::read_data_from_bytes() → Arrow RecordBatch
      write_batch_to_{csv,ndjson,parquet,feather}_bytes()
    }
  → HTTP response

All ReadStat C library FFI calls run inside spawn_blocking to avoid blocking the tokio async runtime.

Python Server

HTTP upload → FastAPI UploadFile → bytes
  → readstat_py.read_to_{csv,ndjson,parquet,feather}(bytes)
    → [PyO3 boundary]
      → ReadStatMetadata::read_metadata_from_bytes()
      → ReadStatData::read_data_from_bytes() → Arrow RecordBatch
      → write_batch_to_*_bytes()
    → [back to Python]
  → HTTP response

The PyO3 binding layer is intentionally thin — 5 functions that take &[u8] and return Vec<u8> (or String for metadata). No complex types cross the FFI boundary.

readstat-wasm Bun Demo

Demonstrates reading SAS .sas7bdat file metadata and data from JavaScript using the readstat-wasm package compiled to WebAssembly via Emscripten. The demo parses a .sas7bdat file entirely in-memory via WASM and converts it to CSV.

Quick start

The pre-built WASM binary is checked into the repository, so only Bun is needed:

cd examples/bun-demo
bun install
bun run index.ts

That’s it. See Expected output below to verify.

If you want to rebuild the WASM from source (requires Rust + Emscripten), see Building from source below.

Building from source

Only needed if you want to modify crates/readstat-wasm/ and rebuild the WASM. Requires Rust, Emscripten SDK, libclang, and Bun.

macOS / Linux:

# Activate Emscripten (first time per terminal session)
source /path/to/emsdk/emsdk_env.sh

# Add the wasm target (first time only)
rustup target add wasm32-unknown-emscripten

# Initialize submodules (first time only)
git submodule update --init --recursive

# Build the wasm package
cd crates/readstat-wasm
cargo build --target wasm32-unknown-emscripten --release
cp target/wasm32-unknown-emscripten/release/readstat_wasm.wasm pkg/

# Run the demo
cd ../../examples/bun-demo
bun install
bun run index.ts

Windows (Git Bash):

# Activate Emscripten (first time per terminal session)
/c/path/to/emsdk/emsdk.bat activate latest
export EMSDK=C:/path/to/emsdk

# Add the wasm target (first time only)
rustup target add wasm32-unknown-emscripten

# Initialize submodules (first time only)
git submodule update --init --recursive

# Build the wasm package
cd crates/readstat-wasm
cargo build --target wasm32-unknown-emscripten --release
cp target/wasm32-unknown-emscripten/release/readstat_wasm.wasm pkg/

# Run the demo
cd ../../examples/bun-demo
bun install
bun run index.ts

Windows (PowerShell):

# Activate Emscripten (first time per terminal session)
C:\path\to\emsdk\emsdk.bat activate latest
$env:EMSDK = "C:\path\to\emsdk"

# Add the wasm target (first time only)
rustup target add wasm32-unknown-emscripten

# Initialize submodules (first time only)
git submodule update --init --recursive

# Build the wasm package
cd crates\readstat-wasm
cargo build --target wasm32-unknown-emscripten --release
copy target\wasm32-unknown-emscripten\release\readstat_wasm.wasm pkg\

# Run the demo
cd ..\..\examples\bun-demo
bun install
bun run index.ts

Install dependencies (for building from source)

Rust + wasm target

# Install Rust (if not already installed)
# macOS / Linux
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Windows — download and run rustup-init.exe from https://rustup.rs

# Add the Emscripten wasm target (all platforms)
rustup target add wasm32-unknown-emscripten

Emscripten SDK

# Clone the SDK
git clone https://github.com/emscripten-core/emsdk.git
cd emsdk

# Install and activate the latest toolchain
./emsdk install latest
./emsdk activate latest

Activate in your shell (run every new terminal session, or add to your profile):

Platform	Command
macOS / Linux	`source ./emsdk_env.sh`
Windows (cmd)	`emsdk_env.bat`
Windows (PowerShell)	`emsdk_env.bat` (then set `$env:EMSDK = "C:\path\to\emsdk"` if needed)
Windows (Git Bash)	`source ./emsdk_env.sh` (then `export EMSDK=C:/path/to/emsdk` if needed)

Note: On Windows, emsdk_env.sh / emsdk_env.bat may update PATH without exporting the EMSDK variable. If the build fails with “EMSDK must be set”, set it manually as shown above. The build script will also attempt to auto-detect the emsdk root from PATH.

libclang (required by bindgen)

Platform	Command
macOS	`brew install llvm`
Ubuntu / Debian	`sudo apt-get install libclang-dev`
Fedora	`sudo dnf install clang-devel`
Windows	Install LLVM from https://releases.llvm.org/download.html and set `LIBCLANG_PATH` to the `lib` directory (e.g., `C:\Program Files\LLVM\lib`)

Bun

# macOS / Linux
curl -fsSL https://bun.sh/install | bash

# Windows (PowerShell)
powershell -c "irm bun.sh/install.ps1 | iex"

Initialize git submodules

From the repository root:

git submodule update --init --recursive

Build the WASM package

# Make sure Emscripten is activated in your shell (see table above)

# From the readstat-wasm crate directory
cd crates/readstat-wasm

# Build with Emscripten target (release mode)
cargo build --target wasm32-unknown-emscripten --release

# Copy the .wasm binary into the pkg/ directory
# macOS / Linux
cp target/wasm32-unknown-emscripten/release/readstat_wasm.wasm pkg/
# Windows (PowerShell)
# copy target\wasm32-unknown-emscripten\release\readstat_wasm.wasm pkg\

Run the demo

cd examples/bun-demo
bun install
bun run index.ts

Expected output

=== SAS7BDAT Metadata ===
Table name:    CARS
File encoding: WINDOWS-1252
Row count:     1081
Variable count:13
Compression:   None
Endianness:    Little
Created:       2008-09-30 12:55:01
Modified:      2008-09-30 12:55:01

=== Variables ===
  [0] Brand (String, )
  [1] Model (String, )
  [2] Minivan (Double, )
  [3] Wagon (Double, )
  [4] Pickup (Double, )
  [5] Automatic (Double, )
  [6] EngineSize (Double, )
  [7] Cylinders (Double, )
  [8] CityMPG (Double, )
  [9] HwyMPG (Double, )
  [10] SUV (Double, )
  [11] AWD (Double, )
  [12] Hybrid (Double, )

=== CSV Data (preview) ===
Brand,Model,Minivan,Wagon,Pickup,Automatic,EngineSize,Cylinders,CityMPG,HwyMPG,SUV,AWD,Hybrid
TOYOTA,Prius,0.0,0.0,0.0,1.0,1.5,4.0,60.0,51.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,1.0,1.3,4.0,48.0,47.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,1.0,1.3,4.0,47.0,48.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,0.0,1.3,4.0,46.0,51.0,0.0,0.0,1.0
HONDA,Civic Hybrid,0.0,0.0,0.0,0.0,1.3,4.0,45.0,51.0,0.0,0.0,1.0
... (1081 total data rows)

Wrote 1081 rows to cars.csv

How it works

The readstat-wasm crate compiles the ReadStat C library and the Rust readstat parsing library to WebAssembly using the wasm32-unknown-emscripten target. Emscripten is required because the underlying ReadStat C code needs a C standard library (libc, iconv) — which Emscripten provides for wasm. (Note: zlib is only needed for SPSS zsav support, which is not included in the current wasm build.)

The crate exports eight C-compatible functions:

Export	Signature	Purpose
`read_metadata`	`(ptr, len) -> *char`	Parse metadata as JSON from a byte buffer
`read_metadata_fast`	`(ptr, len) -> *char`	Same, but skips full row count
`read_data`	`(ptr, len) -> *char`	Parse data and return as CSV string
`read_data_ndjson`	`(ptr, len) -> *char`	Parse data and return as NDJSON string
`read_data_parquet`	`(ptr, len, out_len) -> *u8`	Parse data and return as Parquet bytes
`read_data_feather`	`(ptr, len, out_len) -> *u8`	Parse data and return as Feather bytes
`free_string`	`(ptr)`	Free a string returned by the string functions
`free_binary`	`(ptr, len)`	Free binary data returned by parquet/feather

The data functions perform a two-pass parse over the same byte buffer: first to extract metadata (schema, row count), then to read row values into an Arrow RecordBatch, which is serialized to CSV or NDJSON in memory.

The JS wrapper in pkg/readstat_wasm.js handles:

Loading the .wasm module
Providing minimal WASI and Emscripten import stubs
Memory management (malloc/free for input bytes, free_string for output)
Converting between JS types and wasm pointers

Troubleshooting

EMSDK must be set for Emscripten builds Set the EMSDK environment variable to point to your emsdk installation directory. On macOS/Linux: export EMSDK=/path/to/emsdk. On Windows (PowerShell): $env:EMSDK = "C:\path\to\emsdk". On Windows (Git Bash): export EMSDK=C:/path/to/emsdk. The build script also attempts to auto-detect the emsdk root from your PATH, so simply having Emscripten activated may be sufficient.

error: linking with emcc failed / undefined symbol: main Make sure you’re building from crates/readstat-wasm/ (not the repo root). The .cargo/config.toml in that directory provides the necessary linker flags.

The command line is too long (Windows) This was a known issue when building all ReadStat C source files for the Emscripten target. It has been fixed — the build script now compiles only the SAS format sources for Emscripten builds, keeping the archiver command within Windows’ command-line length limit.

Web Demo: SAS7BDAT Viewer & Converter

Browser-based demo that reads SAS .sas7bdat files entirely client-side using WebAssembly. Upload a file to view metadata, preview data in a sortable table, and export to CSV, NDJSON, Parquet, or Feather.

No build tools, no npm install, no framework — just static files served over HTTP.

Quick start

Copy the WASM binary into this directory (if not already present):
```
cp crates/readstat-wasm/pkg/readstat_wasm.wasm examples/web-demo/
```
If you need to rebuild it first, see the bun-demo README for build instructions.
Serve the directory with any static HTTP server. You must point the server at the directory, not at index.html directly:
```
# From the repo root:
python -m http.server 8000 -d examples/web-demo
npx serve examples/web-demo
bunx serve examples/web-demo

# Or from the web-demo directory:
cd examples/web-demo
python -m http.server 8000
npx serve
bunx serve
```
Note: Do not pass index.html as the argument (e.g., bunx serve index.html). That tells serve to look for a directory named index.html, which will cause the WASM and JS files to 404.
Open http://localhost:3000 (for serve) or http://localhost:8000 (for Python) in your browser.
Upload a .sas7bdat file (e.g., crates/readstat-tests/tests/data/cars.sas7bdat).

Features

Metadata panel — table name, encoding, row/variable count, compression, timestamps
Variable table — name, type, label, and format for each column
Data preview — first 100 rows in a sortable table (uses Tabulator from CDN, with plain HTML table fallback)
Export — download as CSV, NDJSON, Parquet, or Feather

WASM binary

The readstat_wasm.wasm file is built from the readstat-wasm crate (crates/readstat-wasm/). It compiles the ReadStat C library and the Rust readstat parsing library to WebAssembly via the wasm32-unknown-emscripten target. The binary is ~9.7 MB.

A pre-built copy is checked in at crates/readstat-wasm/pkg/readstat_wasm.wasm.

Browser compatibility

Requires a modern browser with WebAssembly support (Chrome 57+, Firefox 52+, Safari 11+, Edge 16+)
Must be served over HTTP(S) — file:// URLs will not work due to WASM fetch() requirements
Tabulator.js is loaded from CDN; if offline, the data preview falls back to a plain HTML table

File structure

examples/web-demo/
├── index.html          # App (HTML + inline CSS + inline JS)
├── readstat_wasm.js    # Browser-compatible WASM wrapper
├── readstat_wasm.wasm  # WASM binary (copied from pkg/)
└── README.md           # This file

SAS7BDAT SQL Explorer

An interactive browser-based tool for uploading .sas7bdat files and querying them with SQL — entirely client-side using WebAssembly.

How It Works

Upload a .sas7bdat file (drag-and-drop or file picker)
The file is parsed in-browser via the readstat-wasm WebAssembly module
Data is loaded into AlaSQL, a client-side SQL engine
Write SQL queries in a syntax-highlighted editor (powered by CodeMirror 6)
View results in an interactive, sortable table (powered by Tabulator)
Export query results as CSV

No data leaves your browser — all processing happens locally.

Quick Start

Serve the directory with any static HTTP server. The entire directory must be served (not just index.html) so the browser can load the .js and .wasm files alongside it.

From the repository root:

# Python
python -m http.server 8000 -d examples/sql-explorer

# Bun
bunx serve examples/sql-explorer

Or cd into the directory and serve from there:

cd examples/sql-explorer

# Python
python -m http.server 8000

# Bun
bunx serve .

Then open http://localhost:8000 in your browser.

Note: The page must be served over HTTP(S) — opening index.html directly as a file:// URL won’t work because browsers block WASM loading from the local filesystem.

WASM Files

The readstat_wasm.js and readstat_wasm.wasm files are pre-built and checked into the repository — no action is needed to get started. They are copies from examples/web-demo/. If you rebuild the WASM module, copy the updated files here as well.

To rebuild from source (requires Emscripten):

cd crates/readstat-wasm
./build.sh
cp pkg/readstat_wasm.js pkg/readstat_wasm.wasm ../../examples/sql-explorer/

CDN Dependencies

All loaded automatically from CDNs — no npm install required:

Library	Version	CDN	Purpose
AlaSQL	4.x	jsdelivr	Client-side SQL engine
CodeMirror 6	6.x	esm.sh	SQL editor with syntax highlighting
Tabulator	6.x	unpkg	Interactive sortable/filterable result tables

Example Queries

Once a file is loaded, the data is available as a table named data. Some queries to try:

-- Preview all rows
SELECT * FROM data LIMIT 100

-- Count rows
SELECT COUNT(*) AS total_rows FROM data

-- Filter rows
SELECT * FROM data WHERE column_name = 'value'

-- Aggregate
SELECT column_name, COUNT(*) AS n FROM data GROUP BY column_name ORDER BY n DESC

-- Select specific columns
SELECT col1, col2, col3 FROM data LIMIT 50

Column names with spaces or special characters should be wrapped in square brackets: [Column Name].

For the full list of supported SQL syntax, see the AlaSQL SQL Reference.

API Documentation (Rustdocs)

Auto-generated API documentation for each crate is available below:

readstat — Library crate for parsing SAS files into Arrow
readstat_cli — CLI binary
readstat_sys — Raw FFI bindings to the ReadStat C library
readstat_iconv_sys — Windows-only iconv FFI bindings
readstat_tests — Integration test suite

Note: These docs are generated by cargo doc and deployed alongside this book by CI.

Keyboard shortcuts

readstat-rs