Usage

💡 Quick reference: A one-page visual CLI Cheatsheet is available for at-a-glance lookup of subcommands, flags, and common workflows. This page is the full reference and goes deeper on memory, parallelism, and metadata round-trips.

After either building or installing, the binary is invoked using subcommands. Currently, the following subcommands have been implemented:

metadata → writes the following to standard out or json
- row count
- variable count
- table name
- table label
- file encoding
- format version
- bitness
- creation time
- modified time
- compression
- byte order
- variable names
- variable type classes
- variable types
- variable labels
- variable format classes
- variable formats
- arrow data types
preview → writes the first 10 rows (or optionally the number of rows provided by the user) of parsed data in csv format to standard out
data → writes parsed data in csv, feather, ndjson, or parquet format to a file

Metadata

To write metadata to standard out, invoke the following.

readstat metadata /some/dir/to/example.sas7bdat

To write metadata to json, invoke the following. This is useful for reading the metadata programmatically.

readstat metadata /some/dir/to/example.sas7bdat --as-json

The JSON output contains file-level metadata and a vars object keyed by variable index. This makes it straightforward to search for a particular column by piping the output to jq or Python.

Skipping the row count

Computing the row count requires traversing the entire file. If only variable-level metadata is needed (names, types, labels, formats), pass --skip-row-count to short-circuit row enumeration:

readstat metadata /some/dir/to/example.sas7bdat --skip-row-count

In that mode the reported row count is 0 and parsing returns as soon as the header and variable definitions have been read.

By default metadata, preview, and data render a progress bar while the file is being parsed. Pass --no-progress (available on all three subcommands) to suppress it — useful in CI logs, when piping output, or when launching readstat from another process.

Search for a column with `jq`

# Find the variable entry whose var_name matches "Make"
readstat metadata /some/dir/to/example.sas7bdat --as-json \
  | jq '.vars | to_entries[] | select(.value.var_name == "Make") | .value'

Search for a column with Python

# Find the variable entry whose var_name matches "Make"
readstat metadata /some/dir/to/example.sas7bdat --as-json \
  | python -c "
import json, sys
md = json.load(sys.stdin)
match = [v for v in md['vars'].values() if v['var_name'] == 'Make']
if match:
    print(json.dumps(match[0], indent=2))
"

Preview Data

To write parsed data (as a csv) to standard out, invoke the following (default is to write the first 10 rows).

readstat preview /some/dir/to/example.sas7bdat

To write the first 100 rows of parsed data (as a csv) to standard out, invoke the following.

readstat preview /some/dir/to/example.sas7bdat --rows 100

Data

📝 The data subcommand includes a parameter for --format, which is the file format that is to be written. Currently, the following formats have been implemented:

csv
feather
ndjson
parquet

By default data refuses to overwrite an existing output file. Pass --overwrite to replace it:

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --overwrite

`csv`

To write parsed data (as csv) to a file, invoke the following (default is to write all parsed data to the specified file).

The default --format is csv. Thus, the parameter is elided from the below examples.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.csv

To write the first 100 rows of parsed data (as csv) to a file, invoke the following.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.csv --rows 100

`feather`

To write parsed data (as feather) to a file, invoke the following (default is to write all parsed data to the specified file).

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.feather --format feather

To write the first 100 rows of parsed data (as feather) to a file, invoke the following.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.feather --format feather --rows 100

`ndjson`

To write parsed data (as ndjson) to a file, invoke the following (default is to write all parsed data to the specified file).

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.ndjson --format ndjson

To write the first 100 rows of parsed data (as ndjson) to a file, invoke the following.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.ndjson --format ndjson --rows 100

`parquet`

To write parsed data (as parquet) to a file, invoke the following (default is to write all parsed data to the specified file).

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet

To write the first 100 rows of parsed data (as parquet) to a file, invoke the following.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --rows 100

To write parsed data (as parquet) to a file with specific compression settings, invoke the following:

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --compression zstd --compression-level 3

Column Selection

Select specific columns to include when converting or previewing data.

Step 1: View available columns

readstat metadata /some/dir/to/example.sas7bdat

Or as JSON for programmatic use with jq:

readstat metadata /some/dir/to/example.sas7bdat --as-json \
  | jq '.vars | to_entries[] | .value.var_name'

Or with Python:

readstat metadata /some/dir/to/example.sas7bdat --as-json \
  | python -c "
import json, sys
md = json.load(sys.stdin)
for v in md['vars'].values():
    print(v['var_name'])
"

Step 2: Select columns on the command line

readstat data /some/dir/to/example.sas7bdat --output out.parquet --format parquet --columns Brand,Model,EngineSize

Step 2 (alt): Select columns from a file

Create columns.txt:

# Columns to extract from the dataset
Brand
Model
EngineSize

Then pass it to the CLI:

readstat data /some/dir/to/example.sas7bdat --output out.parquet --format parquet --columns-file columns.txt

Preview with column selection

readstat preview /some/dir/to/example.sas7bdat --columns Brand,Model,EngineSize

Parallelism

The data subcommand includes parameters for both parallel reading and parallel writing:

Parallel Reading (`--parallel`)

If invoked, the reading of a sas7bdat will occur in parallel. If the total rows to process is greater than stream-rows (if unset, the default rows to stream is 10,000), then each chunk of rows is read in parallel. Note that all processors on the user’s machine are used with the --parallel option. In the future, may consider allowing the user to throttle this number.

❗ Utilizing the --parallel parameter will increase memory usage — all chunks are read in parallel and collected in memory before being sent to the writer. In addition, because all processors are utilized, CPU usage may be maxed out during reading. Row ordering from the original sas7bdat is preserved.

Parallel Writing (`--parallel-write`)

When combined with --parallel, the --parallel-write flag enables parallel writing for Parquet format files. This can significantly improve write performance for large datasets by:

Writing record batches to temporary files in parallel using all available processors
Merging the temporary files into the final output
Using spooled temporary files that keep data in memory until a threshold is reached

Note: Parallel writing currently only supports the Parquet format. Other formats (CSV, Feather, NDJSON) will use optimized sequential writes with BufWriter.

Example usage:

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --parallel --parallel-write

Memory Buffer Size (`--parallel-write-buffer-mb`)

Controls the memory buffer size (in MB) before spilling to disk during parallel writes. Defaults to 100 MB. Valid range: 1-10240 MB.

Smaller buffers will cause data to spill to disk sooner, while larger buffers keep more data in memory. Choose based on your available memory and dataset size:

Small datasets (< 100 MB): Use default or larger buffer to keep everything in memory
Large datasets (> 1 GB): Consider smaller buffer (10-50 MB) to manage memory usage
Memory-constrained systems: Use smaller buffer (1-10 MB)

Example with custom buffer size:

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --parallel --parallel-write --parallel-write-buffer-mb 200

❗ Parallel writing may write batches out of order. This is acceptable for Parquet files as the row order is preserved when merged.

Memory Considerations

Default: Sequential Writes

In the default sequential write mode, a bounded channel (capacity 10) connects the reader thread to the writer. This means at most 10 chunks (each containing up to stream-rows rows) are held in memory at any time, providing natural backpressure when the writer is slower than the reader. For most workloads this keeps memory usage reasonable, but for very wide datasets (hundreds of columns, string-heavy) each chunk can be large — consider lowering --stream-rows if memory is a concern.

Sequential Write (default)
==========================

 Reader Thread                 Bounded Channel (cap 10)            Main Thread
+---------------------+       +------------------------+       +---------------------+
|                     |       |                        |       |                     |
| +-----------+       | send  | +--+--+--+--+--+--+   | recv  | +-------+           |
| | chunk  1  |-------|------>| |  |  |  |  |  |  |   |------>| | write |---> file   |
| +-----------+       |       | +--+--+--+--+--+--+   |       | +-------+           |
| +-----------+       | send  |    channel is full!    |       |                     |
| | chunk  2  |-------|------>| +--+--+--+--+--+--+--+|       | +-------+           |
| +-----------+       |       | |  |  |  |  |  |  |  ||       | | write |---> file   |
| +-----------+       |       | +--+--+--+--+--+--+--+|       | +-------+           |
| | chunk  3  |-------|-XXXXX |                        |       |                     |
| +-----------+       | BLOCK | writer drains a slot   |       | +-------+           |
|   ... waits ...     |       |    +--+--+--+--+--+--+ |       | | write |---> file   |
| | chunk  3  |-------|------>|    |  |  |  |  |  |  | |       | +-------+           |
| +-----------+       | ok!   |    +--+--+--+--+--+--+ |       |                     |
|                     |       |                        |       |                     |
+---------------------+       +------------------------+       +---------------------+

 Memory at any moment: <= 10 chunks in the channel + 1 being written
 Backpressure: reader blocks when channel is full

Parallel Writes (`--parallel-write`)

📝 --parallel-write: Uses bounded-batch processing — batches are pulled from the channel in groups (up to 10 at a time), written in parallel to temporary Parquet files, then the next group is pulled. This preserves the channel’s backpressure so that memory usage stays bounded rather than loading the entire dataset at once. All temporary files are merged into the final output at the end.

Parallel Write (--parallel --parallel-write)
============================================

 Reader Thread              Bounded Channel (cap 10)              Main Thread
+------------------+       +------------------------+       +-------------------------+
|                  |       |                        |       |                         |
| +----------+     | send  |                        | recv  |  Pull <= 10 batches     |
| | chunk  1 |-----|------>|  +-+-+-+-+-+-+-+-+-+-+ |------>|  +----+----+----+----+  |
| +----------+     |       |  | | | | | | | | | | | |       |  | b1 | b2 | .. | bN |  |
| +----------+     | send  |  +-+-+-+-+-+-+-+-+-+-+ |       |  +----+----+----+----+  |
| | chunk  2 |-----|------>|                        |       |    |    |         |      |
| +----------+     |       +------------------------+       |    v    v         v      |
| +----------+     |                                        |  Write in parallel      |
| | chunk  3 |-----|----> ...                               |  to temp .parquet files |
| +----------+     |                                        |    |    |         |      |
|     ...          |                                        |    v    v         v      |
|                  |                                        |  tmp_0 tmp_1 ... tmp_N   |
|                  |       +------------------------+       |                         |
| +----------+     | send  |                        | recv  |  Pull next <= 10        |
| | chunk 11 |-----|------>|  +-+-+-+-+-+-+-+-+-+-+ |------>|  +----+----+----+----+  |
| +----------+     |       |  | | | | | | | | | | | |       |  |b11 |b12 | .. | bM |  |
| +----------+     | send  |  +-+-+-+-+-+-+-+-+-+-+ |       |  +----+----+----+----+  |
| | chunk 12 |-----|------>|                        |       |    |    |         |      |
| +----------+     |       +------------------------+       |    v    v         v      |
|     ...          |                                        |  tmp_N+1  ...  tmp_M     |
+------------------+                                        |                         |
                                                            |  ... repeat until done  |
                                                            +-------------------------+
                                                                       |
                              +----------------------------------------+
                              |
                              v
                    +-------------------+       +--------------------+
                    |   Merge all temp  |       |                    |
                    |   .parquet files  |------>|  final output.pqt  |
                    |   in order        |       |                    |
                    +-------------------+       +--------------------+

 Memory at any moment: <= 10 chunks in channel + 10 being written
 Backpressure: preserved -- reader blocks while a batch group is being written

SQL Queries (`--sql` / `--sql-file`)

⚠️ --sql is feature-gated. SQL support is not enabled by default. The binary published via cargo install readstat-cli does not include it; to get --sql/--sql-file, install with the feature explicitly:

cargo install readstat-cli --features sql

Provide the query inline with --sql "SELECT ...", or point at a file containing the query with --sql-file path/to/query.sql. The table name is the input file stem (e.g. cars for cars.sas7bdat). --sql and --sql-file are mutually exclusive with each other and with --columns/--columns-file.

# inline query
readstat data cars.sas7bdat --output out.parquet --sql "SELECT make, mpg FROM cars WHERE mpg > 30"

# query from a file
readstat data cars.sas7bdat --output out.parquet --sql-file query.sql

SQL queries require the full dataset to be materialized in memory via DataFusion’s MemTable before query execution. For large files this may result in significant memory usage. Queries that filter rows (e.g. SELECT ... WHERE ...) will reduce the output size but the input must still be fully loaded.

SQL Query Mode (--sql "SELECT ...")
===================================

 Reader Thread              Bounded Channel              Main Thread
+------------------+       +---------------+       +---------------------------+
|                  |       |               |       |                           |
| +----------+     | send  |               | recv  |  Collect ALL batches      |
| | chunk  1 |-----|------>|               |------>|  into memory (required    |
| +----------+     |       |               |       |  by DataFusion MemTable)  |
| +----------+     | send  |               |       |                           |
| | chunk  2 |-----|------>|               |------>|  +-----+-----+-----+     |
| +----------+     |       |               |       |  |  b1 |  b2 | ... |     |
|     ...          |       |               |       |  +-----+-----+-----+     |
| +----------+     | send  |               |       |         |                 |
| | chunk  N |-----|------>|               |------>|         v                 |
| +----------+     |       |               |       |  +-------------+         |
+------------------+       +---------------+       |  |  DataFusion |         |
                                                   |  |  SQL Engine |         |
                                                   |  +-------------+         |
                                                   |         |                 |
                                                   |         v                 |
                                                   |  Write filtered results  |
                                                   |  to output file          |
                                                   +---------------------------+

 Memory at peak: ALL chunks in memory (no backpressure)
 This is inherent to SQL execution over in-memory tables.

Reading Metadata from Output Files

When converting to Parquet or Feather, readstat-rs preserves column metadata (labels, SAS format strings, and storage widths) as Arrow field metadata. Schema-level metadata includes the table label when present.

The following metadata keys may appear on each field:

Key	Description	Condition
`label`	User-assigned variable label	Non-empty
`sas_format`	SAS format string (e.g. `DATE9`, `BEST12`, `$30`)	Non-empty
`storage_width`	Number of bytes used to store the variable	Always
`display_width`	Display width hint from the file	Non-zero

Schema-level metadata:

Key	Description	Condition
`table_label`	User-assigned file label	Non-empty

Reading metadata with Python (pyarrow)

import pyarrow.parquet as pq

schema = pq.read_schema("example.parquet")

# Table-level metadata
print(schema.metadata.get(b"table_label", b"").decode())

# Per-column metadata
for field in schema:
    meta = field.metadata or {}
    print(f"{field.name}:")
    print(f"  label:         {meta.get(b'label', b'').decode()}")
    print(f"  sas_format:    {meta.get(b'sas_format', b'').decode()}")
    print(f"  storage_width: {meta.get(b'storage_width', b'').decode()}")
    print(f"  display_width: {meta.get(b'display_width', b'').decode()}")

Reading metadata with R (arrow)

library(arrow)

schema <- read_parquet("example.parquet", as_data_frame = FALSE)$schema

# Per-column metadata
for (field in schema) {
  cat(field$name, "\n")
  cat("  label:        ", field$metadata$label, "\n")
  cat("  sas_format:   ", field$metadata$sas_format, "\n")
  cat("  storage_width:", field$metadata$storage_width, "\n")
  cat("  display_width:", field$metadata$display_width, "\n")
}

Reader

The preview and data subcommands include a parameter for --reader. The possible values for --reader include the following.

mem → Parse and read the entire sas7bdat into memory before writing to either standard out or a file
stream (default) → Parse and read at most stream-rows into memory before writing to disk
- stream-rows may be set via the command line parameter --stream-rows or if elided will default to 10,000 rows

Why is this useful?

mem is useful for testing purposes
stream is useful for keeping memory usage low for large datasets (and hence is the default)
In general, users should not need to deviate from the default — stream — unless they have a specific need
In addition, by enabling these options as command line parameters hyperfine may be used to benchmark across an assortment of file sizes

Debug

Debug information is printed to standard out by setting the environment variable RUST_LOG=debug before the call to readstat.

⚠️ This is quite verbose! If using the preview or data subcommand, will write debug information for every single value!

# Linux and macOS
RUST_LOG=debug readstat ...

# Windows PowerShell
$env:RUST_LOG="debug"; readstat ...

Help

For full details run with --help.

readstat --help
readstat metadata --help
readstat preview --help
readstat data --help

Keyboard shortcuts

readstat-rs