Usage
After either building or installing, the binary is invoked using subcommands. Currently, the following subcommands have been implemented:
metadata→ writes the following to standard out or json- row count
- variable count
- table name
- table label
- file encoding
- format version
- bitness
- creation time
- modified time
- compression
- byte order
- variable names
- variable type classes
- variable types
- variable labels
- variable format classes
- variable formats
- arrow data types
preview→ writes the first 10 rows (or optionally the number of rows provided by the user) of parsed data incsvformat to standard outdata→ writes parsed data incsv,feather,ndjson, orparquetformat to a file
Metadata
To write metadata to standard out, invoke the following.
readstat metadata /some/dir/to/example.sas7bdat
To write metadata to json, invoke the following. This is useful for reading the metadata programmatically.
readstat metadata /some/dir/to/example.sas7bdat --as-json
The JSON output contains file-level metadata and a vars object keyed by variable index. This makes it straightforward to search for a particular column by piping the output to jq or Python.
Search for a column with jq
# Find the variable entry whose var_name matches "Make"
readstat metadata /some/dir/to/example.sas7bdat --as-json \
| jq '.vars | to_entries[] | select(.value.var_name == "Make") | .value'
Search for a column with Python
# Find the variable entry whose var_name matches "Make"
readstat metadata /some/dir/to/example.sas7bdat --as-json \
| python -c "
import json, sys
md = json.load(sys.stdin)
match = [v for v in md['vars'].values() if v['var_name'] == 'Make']
if match:
print(json.dumps(match[0], indent=2))
"
Preview Data
To write parsed data (as a csv) to standard out, invoke the following (default is to write the first 10 rows).
readstat preview /some/dir/to/example.sas7bdat
To write the first 100 rows of parsed data (as a csv) to standard out, invoke the following.
readstat preview /some/dir/to/example.sas7bdat --rows 100
Data
📝 The data subcommand includes a parameter for --format, which is the file format that is to be written. Currently, the following formats have been implemented:
csvfeatherndjsonparquet
csv
To write parsed data (as csv) to a file, invoke the following (default is to write all parsed data to the specified file).
The default --format is csv. Thus, the parameter is elided from the below examples.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.csv
To write the first 100 rows of parsed data (as csv) to a file, invoke the following.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.csv --rows 100
feather
To write parsed data (as feather) to a file, invoke the following (default is to write all parsed data to the specified file).
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.feather --format feather
To write the first 100 rows of parsed data (as feather) to a file, invoke the following.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.feather --format feather --rows 100
ndjson
To write parsed data (as ndjson) to a file, invoke the following (default is to write all parsed data to the specified file).
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.ndjson --format ndjson
To write the first 100 rows of parsed data (as ndjson) to a file, invoke the following.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.ndjson --format ndjson --rows 100
parquet
To write parsed data (as parquet) to a file, invoke the following (default is to write all parsed data to the specified file).
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet
To write the first 100 rows of parsed data (as parquet) to a file, invoke the following.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --rows 100
To write parsed data (as parquet) to a file with specific compression settings, invoke the following:
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --compression zstd --compression-level 3
Column Selection
Select specific columns to include when converting or previewing data.
Step 1: View available columns
readstat metadata /some/dir/to/example.sas7bdat
Or as JSON for programmatic use with jq:
readstat metadata /some/dir/to/example.sas7bdat --as-json \
| jq '.vars | to_entries[] | .value.var_name'
Or with Python:
readstat metadata /some/dir/to/example.sas7bdat --as-json \
| python -c "
import json, sys
md = json.load(sys.stdin)
for v in md['vars'].values():
print(v['var_name'])
"
Step 2: Select columns on the command line
readstat data /some/dir/to/example.sas7bdat --output out.parquet --format parquet --columns Brand,Model,EngineSize
Step 2 (alt): Select columns from a file
Create columns.txt:
# Columns to extract from the dataset
Brand
Model
EngineSize
Then pass it to the CLI:
readstat data /some/dir/to/example.sas7bdat --output out.parquet --format parquet --columns-file columns.txt
Preview with column selection
readstat preview /some/dir/to/example.sas7bdat --columns Brand,Model,EngineSize
Parallelism
The data subcommand includes parameters for both parallel reading and parallel writing:
Parallel Reading (--parallel)
If invoked, the reading of a sas7bdat will occur in parallel. If the total rows to process is greater than stream-rows (if unset, the default rows to stream is 10,000), then each chunk of rows is read in parallel. Note that all processors on the user’s machine are used with the --parallel option. In the future, may consider allowing the user to throttle this number.
❗ Utilizing the --parallel parameter will increase memory usage — all chunks are read in parallel and collected in memory before being sent to the writer. In addition, because all processors are utilized, CPU usage may be maxed out during reading. Row ordering from the original sas7bdat is preserved.
Parallel Writing (--parallel-write)
When combined with --parallel, the --parallel-write flag enables parallel writing for Parquet format files. This can significantly improve write performance for large datasets by:
- Writing record batches to temporary files in parallel using all available processors
- Merging the temporary files into the final output
- Using spooled temporary files that keep data in memory until a threshold is reached
Note: Parallel writing currently only supports the Parquet format. Other formats (CSV, Feather, NDJSON) will use optimized sequential writes with BufWriter.
Example usage:
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --parallel --parallel-write
Memory Buffer Size (--parallel-write-buffer-mb)
Controls the memory buffer size (in MB) before spilling to disk during parallel writes. Defaults to 100 MB. Valid range: 1-10240 MB.
Smaller buffers will cause data to spill to disk sooner, while larger buffers keep more data in memory. Choose based on your available memory and dataset size:
- Small datasets (< 100 MB): Use default or larger buffer to keep everything in memory
- Large datasets (> 1 GB): Consider smaller buffer (10-50 MB) to manage memory usage
- Memory-constrained systems: Use smaller buffer (1-10 MB)
Example with custom buffer size:
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --parallel --parallel-write --parallel-write-buffer-mb 200
❗ Parallel writing may write batches out of order. This is acceptable for Parquet files as the row order is preserved when merged.
Memory Considerations
Default: Sequential Writes
In the default sequential write mode, a bounded channel (capacity 10) connects the reader thread to the writer. This means at most 10 chunks (each containing up to stream-rows rows) are held in memory at any time, providing natural backpressure when the writer is slower than the reader. For most workloads this keeps memory usage reasonable, but for very wide datasets (hundreds of columns, string-heavy) each chunk can be large — consider lowering --stream-rows if memory is a concern.
Sequential Write (default)
==========================
Reader Thread Bounded Channel (cap 10) Main Thread
+---------------------+ +------------------------+ +---------------------+
| | | | | |
| +-----------+ | send | +--+--+--+--+--+--+ | recv | +-------+ |
| | chunk 1 |-------|------>| | | | | | | | |------>| | write |---> file |
| +-----------+ | | +--+--+--+--+--+--+ | | +-------+ |
| +-----------+ | send | channel is full! | | |
| | chunk 2 |-------|------>| +--+--+--+--+--+--+--+| | +-------+ |
| +-----------+ | | | | | | | | | || | | write |---> file |
| +-----------+ | | +--+--+--+--+--+--+--+| | +-------+ |
| | chunk 3 |-------|-XXXXX | | | |
| +-----------+ | BLOCK | writer drains a slot | | +-------+ |
| ... waits ... | | +--+--+--+--+--+--+ | | | write |---> file |
| | chunk 3 |-------|------>| | | | | | | | | | +-------+ |
| +-----------+ | ok! | +--+--+--+--+--+--+ | | |
| | | | | |
+---------------------+ +------------------------+ +---------------------+
Memory at any moment: <= 10 chunks in the channel + 1 being written
Backpressure: reader blocks when channel is full
Parallel Writes (--parallel-write)
📝 --parallel-write: Uses bounded-batch processing — batches are pulled from the channel in groups (up to 10 at a time), written in parallel to temporary Parquet files, then the next group is pulled. This preserves the channel’s backpressure so that memory usage stays bounded rather than loading the entire dataset at once. All temporary files are merged into the final output at the end.
Parallel Write (--parallel --parallel-write)
============================================
Reader Thread Bounded Channel (cap 10) Main Thread
+------------------+ +------------------------+ +-------------------------+
| | | | | |
| +----------+ | send | | recv | Pull <= 10 batches |
| | chunk 1 |-----|------>| +-+-+-+-+-+-+-+-+-+-+ |------>| +----+----+----+----+ |
| +----------+ | | | | | | | | | | | | | | | | b1 | b2 | .. | bN | |
| +----------+ | send | +-+-+-+-+-+-+-+-+-+-+ | | +----+----+----+----+ |
| | chunk 2 |-----|------>| | | | | | |
| +----------+ | +------------------------+ | v v v |
| +----------+ | | Write in parallel |
| | chunk 3 |-----|----> ... | to temp .parquet files |
| +----------+ | | | | | |
| ... | | v v v |
| | | tmp_0 tmp_1 ... tmp_N |
| | +------------------------+ | |
| +----------+ | send | | recv | Pull next <= 10 |
| | chunk 11 |-----|------>| +-+-+-+-+-+-+-+-+-+-+ |------>| +----+----+----+----+ |
| +----------+ | | | | | | | | | | | | | | | |b11 |b12 | .. | bM | |
| +----------+ | send | +-+-+-+-+-+-+-+-+-+-+ | | +----+----+----+----+ |
| | chunk 12 |-----|------>| | | | | | |
| +----------+ | +------------------------+ | v v v |
| ... | | tmp_N+1 ... tmp_M |
+------------------+ | |
| ... repeat until done |
+-------------------------+
|
+----------------------------------------+
|
v
+-------------------+ +--------------------+
| Merge all temp | | |
| .parquet files |------>| final output.pqt |
| in order | | |
+-------------------+ +--------------------+
Memory at any moment: <= 10 chunks in channel + 10 being written
Backpressure: preserved -- reader blocks while a batch group is being written
SQL Queries (--sql)
⚠️ --sql (feature-gated): SQL queries require the full dataset to be materialized in memory via DataFusion’s MemTable before query execution. For large files this may result in significant memory usage. Queries that filter rows (e.g. SELECT ... WHERE ...) will reduce the output size but the input must still be fully loaded.
SQL Query Mode (--sql "SELECT ...")
===================================
Reader Thread Bounded Channel Main Thread
+------------------+ +---------------+ +---------------------------+
| | | | | |
| +----------+ | send | | recv | Collect ALL batches |
| | chunk 1 |-----|------>| |------>| into memory (required |
| +----------+ | | | | by DataFusion MemTable) |
| +----------+ | send | | | |
| | chunk 2 |-----|------>| |------>| +-----+-----+-----+ |
| +----------+ | | | | | b1 | b2 | ... | |
| ... | | | | +-----+-----+-----+ |
| +----------+ | send | | | | |
| | chunk N |-----|------>| |------>| v |
| +----------+ | | | | +-------------+ |
+------------------+ +---------------+ | | DataFusion | |
| | SQL Engine | |
| +-------------+ |
| | |
| v |
| Write filtered results |
| to output file |
+---------------------------+
Memory at peak: ALL chunks in memory (no backpressure)
This is inherent to SQL execution over in-memory tables.
Reading Metadata from Output Files
When converting to Parquet or Feather, readstat-rs preserves column metadata (labels, SAS format strings, and storage widths) as Arrow field metadata. Schema-level metadata includes the table label when present.
The following metadata keys may appear on each field:
| Key | Description | Condition |
|---|---|---|
label | User-assigned variable label | Non-empty |
sas_format | SAS format string (e.g. DATE9, BEST12, $30) | Non-empty |
storage_width | Number of bytes used to store the variable | Always |
display_width | Display width hint from the file | Non-zero |
Schema-level metadata:
| Key | Description | Condition |
|---|---|---|
table_label | User-assigned file label | Non-empty |
Reading metadata with Python (pyarrow)
import pyarrow.parquet as pq
schema = pq.read_schema("example.parquet")
# Table-level metadata
print(schema.metadata.get(b"table_label", b"").decode())
# Per-column metadata
for field in schema:
meta = field.metadata or {}
print(f"{field.name}:")
print(f" label: {meta.get(b'label', b'').decode()}")
print(f" sas_format: {meta.get(b'sas_format', b'').decode()}")
print(f" storage_width: {meta.get(b'storage_width', b'').decode()}")
print(f" display_width: {meta.get(b'display_width', b'').decode()}")
Reading metadata with R (arrow)
library(arrow)
schema <- read_parquet("example.parquet", as_data_frame = FALSE)$schema
# Per-column metadata
for (field in schema) {
cat(field$name, "\n")
cat(" label: ", field$metadata$label, "\n")
cat(" sas_format: ", field$metadata$sas_format, "\n")
cat(" storage_width:", field$metadata$storage_width, "\n")
cat(" display_width:", field$metadata$display_width, "\n")
}
Reader
The preview and data subcommands include a parameter for --reader. The possible values for --reader include the following.
mem→ Parse and read the entiresas7bdatinto memory before writing to either standard out or a filestream(default) → Parse and read at moststream-rowsinto memory before writing to diskstream-rowsmay be set via the command line parameter--stream-rowsor if elided will default to 10,000 rows
Why is this useful?
memis useful for testing purposesstreamis useful for keeping memory usage low for large datasets (and hence is the default)- In general, users should not need to deviate from the default —
stream— unless they have a specific need - In addition, by enabling these options as command line parameters hyperfine may be used to benchmark across an assortment of file sizes
Debug
Debug information is printed to standard out by setting the environment variable RUST_LOG=debug before the call to readstat.
⚠️ This is quite verbose! If using the preview or data subcommand, will write debug information for every single value!
# Linux and macOS
RUST_LOG=debug readstat ...
# Windows PowerShell
$env:RUST_LOG="debug"; readstat ...
Help
For full details run with --help.
readstat --help
readstat metadata --help
readstat preview --help
readstat data --help