Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Usage

After either building or installing, the binary is invoked using subcommands. Currently, the following subcommands have been implemented:

  • metadata → writes the following to standard out or json
    • row count
    • variable count
    • table name
    • table label
    • file encoding
    • format version
    • bitness
    • creation time
    • modified time
    • compression
    • byte order
    • variable names
    • variable type classes
    • variable types
    • variable labels
    • variable format classes
    • variable formats
    • arrow data types
  • preview → writes the first 10 rows (or optionally the number of rows provided by the user) of parsed data in csv format to standard out
  • data → writes parsed data in csv, feather, ndjson, or parquet format to a file

Metadata

To write metadata to standard out, invoke the following.

readstat metadata /some/dir/to/example.sas7bdat

To write metadata to json, invoke the following. This is useful for reading the metadata programmatically.

readstat metadata /some/dir/to/example.sas7bdat --as-json

The JSON output contains file-level metadata and a vars object keyed by variable index. This makes it straightforward to search for a particular column by piping the output to jq or Python.

Search for a column with jq

# Find the variable entry whose var_name matches "Make"
readstat metadata /some/dir/to/example.sas7bdat --as-json \
  | jq '.vars | to_entries[] | select(.value.var_name == "Make") | .value'

Search for a column with Python

# Find the variable entry whose var_name matches "Make"
readstat metadata /some/dir/to/example.sas7bdat --as-json \
  | python -c "
import json, sys
md = json.load(sys.stdin)
match = [v for v in md['vars'].values() if v['var_name'] == 'Make']
if match:
    print(json.dumps(match[0], indent=2))
"

Preview Data

To write parsed data (as a csv) to standard out, invoke the following (default is to write the first 10 rows).

readstat preview /some/dir/to/example.sas7bdat

To write the first 100 rows of parsed data (as a csv) to standard out, invoke the following.

readstat preview /some/dir/to/example.sas7bdat --rows 100

Data

📝 The data subcommand includes a parameter for --format, which is the file format that is to be written. Currently, the following formats have been implemented:

  • csv
  • feather
  • ndjson
  • parquet

csv

To write parsed data (as csv) to a file, invoke the following (default is to write all parsed data to the specified file).

The default --format is csv. Thus, the parameter is elided from the below examples.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.csv

To write the first 100 rows of parsed data (as csv) to a file, invoke the following.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.csv --rows 100

feather

To write parsed data (as feather) to a file, invoke the following (default is to write all parsed data to the specified file).

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.feather --format feather

To write the first 100 rows of parsed data (as feather) to a file, invoke the following.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.feather --format feather --rows 100

ndjson

To write parsed data (as ndjson) to a file, invoke the following (default is to write all parsed data to the specified file).

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.ndjson --format ndjson

To write the first 100 rows of parsed data (as ndjson) to a file, invoke the following.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.ndjson --format ndjson --rows 100

parquet

To write parsed data (as parquet) to a file, invoke the following (default is to write all parsed data to the specified file).

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet

To write the first 100 rows of parsed data (as parquet) to a file, invoke the following.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --rows 100

To write parsed data (as parquet) to a file with specific compression settings, invoke the following:

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --compression zstd --compression-level 3

Column Selection

Select specific columns to include when converting or previewing data.

Step 1: View available columns

readstat metadata /some/dir/to/example.sas7bdat

Or as JSON for programmatic use with jq:

readstat metadata /some/dir/to/example.sas7bdat --as-json \
  | jq '.vars | to_entries[] | .value.var_name'

Or with Python:

readstat metadata /some/dir/to/example.sas7bdat --as-json \
  | python -c "
import json, sys
md = json.load(sys.stdin)
for v in md['vars'].values():
    print(v['var_name'])
"

Step 2: Select columns on the command line

readstat data /some/dir/to/example.sas7bdat --output out.parquet --format parquet --columns Brand,Model,EngineSize

Step 2 (alt): Select columns from a file

Create columns.txt:

# Columns to extract from the dataset
Brand
Model
EngineSize

Then pass it to the CLI:

readstat data /some/dir/to/example.sas7bdat --output out.parquet --format parquet --columns-file columns.txt

Preview with column selection

readstat preview /some/dir/to/example.sas7bdat --columns Brand,Model,EngineSize

Parallelism

The data subcommand includes parameters for both parallel reading and parallel writing:

Parallel Reading (--parallel)

If invoked, the reading of a sas7bdat will occur in parallel. If the total rows to process is greater than stream-rows (if unset, the default rows to stream is 10,000), then each chunk of rows is read in parallel. Note that all processors on the user’s machine are used with the --parallel option. In the future, may consider allowing the user to throttle this number.

❗ Utilizing the --parallel parameter will increase memory usage — all chunks are read in parallel and collected in memory before being sent to the writer. In addition, because all processors are utilized, CPU usage may be maxed out during reading. Row ordering from the original sas7bdat is preserved.

Parallel Writing (--parallel-write)

When combined with --parallel, the --parallel-write flag enables parallel writing for Parquet format files. This can significantly improve write performance for large datasets by:

  • Writing record batches to temporary files in parallel using all available processors
  • Merging the temporary files into the final output
  • Using spooled temporary files that keep data in memory until a threshold is reached

Note: Parallel writing currently only supports the Parquet format. Other formats (CSV, Feather, NDJSON) will use optimized sequential writes with BufWriter.

Example usage:

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --parallel --parallel-write

Memory Buffer Size (--parallel-write-buffer-mb)

Controls the memory buffer size (in MB) before spilling to disk during parallel writes. Defaults to 100 MB. Valid range: 1-10240 MB.

Smaller buffers will cause data to spill to disk sooner, while larger buffers keep more data in memory. Choose based on your available memory and dataset size:

  • Small datasets (< 100 MB): Use default or larger buffer to keep everything in memory
  • Large datasets (> 1 GB): Consider smaller buffer (10-50 MB) to manage memory usage
  • Memory-constrained systems: Use smaller buffer (1-10 MB)

Example with custom buffer size:

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --parallel --parallel-write --parallel-write-buffer-mb 200

❗ Parallel writing may write batches out of order. This is acceptable for Parquet files as the row order is preserved when merged.

Memory Considerations

Default: Sequential Writes

In the default sequential write mode, a bounded channel (capacity 10) connects the reader thread to the writer. This means at most 10 chunks (each containing up to stream-rows rows) are held in memory at any time, providing natural backpressure when the writer is slower than the reader. For most workloads this keeps memory usage reasonable, but for very wide datasets (hundreds of columns, string-heavy) each chunk can be large — consider lowering --stream-rows if memory is a concern.

Sequential Write (default)
==========================

 Reader Thread                 Bounded Channel (cap 10)            Main Thread
+---------------------+       +------------------------+       +---------------------+
|                     |       |                        |       |                     |
| +-----------+       | send  | +--+--+--+--+--+--+   | recv  | +-------+           |
| | chunk  1  |-------|------>| |  |  |  |  |  |  |   |------>| | write |---> file   |
| +-----------+       |       | +--+--+--+--+--+--+   |       | +-------+           |
| +-----------+       | send  |    channel is full!    |       |                     |
| | chunk  2  |-------|------>| +--+--+--+--+--+--+--+|       | +-------+           |
| +-----------+       |       | |  |  |  |  |  |  |  ||       | | write |---> file   |
| +-----------+       |       | +--+--+--+--+--+--+--+|       | +-------+           |
| | chunk  3  |-------|-XXXXX |                        |       |                     |
| +-----------+       | BLOCK | writer drains a slot   |       | +-------+           |
|   ... waits ...     |       |    +--+--+--+--+--+--+ |       | | write |---> file   |
| | chunk  3  |-------|------>|    |  |  |  |  |  |  | |       | +-------+           |
| +-----------+       | ok!   |    +--+--+--+--+--+--+ |       |                     |
|                     |       |                        |       |                     |
+---------------------+       +------------------------+       +---------------------+

 Memory at any moment: <= 10 chunks in the channel + 1 being written
 Backpressure: reader blocks when channel is full

Parallel Writes (--parallel-write)

📝 --parallel-write: Uses bounded-batch processing — batches are pulled from the channel in groups (up to 10 at a time), written in parallel to temporary Parquet files, then the next group is pulled. This preserves the channel’s backpressure so that memory usage stays bounded rather than loading the entire dataset at once. All temporary files are merged into the final output at the end.

Parallel Write (--parallel --parallel-write)
============================================

 Reader Thread              Bounded Channel (cap 10)              Main Thread
+------------------+       +------------------------+       +-------------------------+
|                  |       |                        |       |                         |
| +----------+     | send  |                        | recv  |  Pull <= 10 batches     |
| | chunk  1 |-----|------>|  +-+-+-+-+-+-+-+-+-+-+ |------>|  +----+----+----+----+  |
| +----------+     |       |  | | | | | | | | | | | |       |  | b1 | b2 | .. | bN |  |
| +----------+     | send  |  +-+-+-+-+-+-+-+-+-+-+ |       |  +----+----+----+----+  |
| | chunk  2 |-----|------>|                        |       |    |    |         |      |
| +----------+     |       +------------------------+       |    v    v         v      |
| +----------+     |                                        |  Write in parallel      |
| | chunk  3 |-----|----> ...                               |  to temp .parquet files |
| +----------+     |                                        |    |    |         |      |
|     ...          |                                        |    v    v         v      |
|                  |                                        |  tmp_0 tmp_1 ... tmp_N   |
|                  |       +------------------------+       |                         |
| +----------+     | send  |                        | recv  |  Pull next <= 10        |
| | chunk 11 |-----|------>|  +-+-+-+-+-+-+-+-+-+-+ |------>|  +----+----+----+----+  |
| +----------+     |       |  | | | | | | | | | | | |       |  |b11 |b12 | .. | bM |  |
| +----------+     | send  |  +-+-+-+-+-+-+-+-+-+-+ |       |  +----+----+----+----+  |
| | chunk 12 |-----|------>|                        |       |    |    |         |      |
| +----------+     |       +------------------------+       |    v    v         v      |
|     ...          |                                        |  tmp_N+1  ...  tmp_M     |
+------------------+                                        |                         |
                                                            |  ... repeat until done  |
                                                            +-------------------------+
                                                                       |
                              +----------------------------------------+
                              |
                              v
                    +-------------------+       +--------------------+
                    |   Merge all temp  |       |                    |
                    |   .parquet files  |------>|  final output.pqt  |
                    |   in order        |       |                    |
                    +-------------------+       +--------------------+

 Memory at any moment: <= 10 chunks in channel + 10 being written
 Backpressure: preserved -- reader blocks while a batch group is being written

SQL Queries (--sql)

⚠️ --sql (feature-gated): SQL queries require the full dataset to be materialized in memory via DataFusion’s MemTable before query execution. For large files this may result in significant memory usage. Queries that filter rows (e.g. SELECT ... WHERE ...) will reduce the output size but the input must still be fully loaded.

SQL Query Mode (--sql "SELECT ...")
===================================

 Reader Thread              Bounded Channel              Main Thread
+------------------+       +---------------+       +---------------------------+
|                  |       |               |       |                           |
| +----------+     | send  |               | recv  |  Collect ALL batches      |
| | chunk  1 |-----|------>|               |------>|  into memory (required    |
| +----------+     |       |               |       |  by DataFusion MemTable)  |
| +----------+     | send  |               |       |                           |
| | chunk  2 |-----|------>|               |------>|  +-----+-----+-----+     |
| +----------+     |       |               |       |  |  b1 |  b2 | ... |     |
|     ...          |       |               |       |  +-----+-----+-----+     |
| +----------+     | send  |               |       |         |                 |
| | chunk  N |-----|------>|               |------>|         v                 |
| +----------+     |       |               |       |  +-------------+         |
+------------------+       +---------------+       |  |  DataFusion |         |
                                                   |  |  SQL Engine |         |
                                                   |  +-------------+         |
                                                   |         |                 |
                                                   |         v                 |
                                                   |  Write filtered results  |
                                                   |  to output file          |
                                                   +---------------------------+

 Memory at peak: ALL chunks in memory (no backpressure)
 This is inherent to SQL execution over in-memory tables.

Reading Metadata from Output Files

When converting to Parquet or Feather, readstat-rs preserves column metadata (labels, SAS format strings, and storage widths) as Arrow field metadata. Schema-level metadata includes the table label when present.

The following metadata keys may appear on each field:

KeyDescriptionCondition
labelUser-assigned variable labelNon-empty
sas_formatSAS format string (e.g. DATE9, BEST12, $30)Non-empty
storage_widthNumber of bytes used to store the variableAlways
display_widthDisplay width hint from the fileNon-zero

Schema-level metadata:

KeyDescriptionCondition
table_labelUser-assigned file labelNon-empty

Reading metadata with Python (pyarrow)

import pyarrow.parquet as pq

schema = pq.read_schema("example.parquet")

# Table-level metadata
print(schema.metadata.get(b"table_label", b"").decode())

# Per-column metadata
for field in schema:
    meta = field.metadata or {}
    print(f"{field.name}:")
    print(f"  label:         {meta.get(b'label', b'').decode()}")
    print(f"  sas_format:    {meta.get(b'sas_format', b'').decode()}")
    print(f"  storage_width: {meta.get(b'storage_width', b'').decode()}")
    print(f"  display_width: {meta.get(b'display_width', b'').decode()}")

Reading metadata with R (arrow)

library(arrow)

schema <- read_parquet("example.parquet", as_data_frame = FALSE)$schema

# Per-column metadata
for (field in schema) {
  cat(field$name, "\n")
  cat("  label:        ", field$metadata$label, "\n")
  cat("  sas_format:   ", field$metadata$sas_format, "\n")
  cat("  storage_width:", field$metadata$storage_width, "\n")
  cat("  display_width:", field$metadata$display_width, "\n")
}

Reader

The preview and data subcommands include a parameter for --reader. The possible values for --reader include the following.

  • mem → Parse and read the entire sas7bdat into memory before writing to either standard out or a file
  • stream (default) → Parse and read at most stream-rows into memory before writing to disk
    • stream-rows may be set via the command line parameter --stream-rows or if elided will default to 10,000 rows

Why is this useful?

  • mem is useful for testing purposes
  • stream is useful for keeping memory usage low for large datasets (and hence is the default)
  • In general, users should not need to deviate from the default — stream — unless they have a specific need
  • In addition, by enabling these options as command line parameters hyperfine may be used to benchmark across an assortment of file sizes

Debug

Debug information is printed to standard out by setting the environment variable RUST_LOG=debug before the call to readstat.

⚠️ This is quite verbose! If using the preview or data subcommand, will write debug information for every single value!

# Linux and macOS
RUST_LOG=debug readstat ...
# Windows PowerShell
$env:RUST_LOG="debug"; readstat ...

Help

For full details run with --help.

readstat --help
readstat metadata --help
readstat preview --help
readstat data --help