Crate readstat

Crate readstat 

Source
Expand description

Read SAS binary files (.sas7bdat) and convert them to other formats.

This crate provides a library for parsing SAS binary data files using FFI bindings to the ReadStat C library, then converting the parsed data into Apache Arrow RecordBatch format for output as CSV, Feather (Arrow IPC), NDJSON, or Parquet.

Note: While the underlying readstat-sys crate exposes bindings for all formats supported by ReadStat (SAS, SPSS, Stata), this crate currently only implements parsing and conversion for SAS .sas7bdat files.

§Data Pipeline

.sas7bdat file
    → ReadStat C library (FFI parsing via callbacks)
        → Typed Arrow builders (StringBuilder, Float64Builder, etc.)
            → Arrow RecordBatch
                → Output format (CSV / Feather / NDJSON / Parquet)

§Examples

§Inspect file metadata

Read metadata without loading any row data. Useful for discovering schema, row counts, variable types, and SAS format classifications.

use readstat::{ReadStatPath, ReadStatMetadata};

let rsp = ReadStatPath::new("data.sas7bdat".into())?;

let mut md = ReadStatMetadata::new();
md.read_metadata(&rsp, false)?;

println!("Rows: {}, Variables: {}", md.row_count, md.var_count);
println!("Encoding: {}", md.file_encoding);
println!("Compression: {:?}", md.compression);

// Iterate over variable metadata
for (idx, var) in &md.vars {
    println!(
        "  [{idx}] {} ({:?}, format: {})",
        var.var_name, var.var_type_class, var.var_format
    );
}

// The Arrow schema is also available
println!("Schema: {:?}", md.schema);

§Read all data into Arrow RecordBatch

Parse the entire file into a single Arrow RecordBatch. Best for smaller files that fit comfortably in memory.

use readstat::{ReadStatPath, ReadStatMetadata, ReadStatData};

let rsp = ReadStatPath::new("data.sas7bdat".into())?;

// Read metadata first
let mut md = ReadStatMetadata::new();
md.read_metadata(&rsp, false)?;

// Read all rows into a single chunk
let row_count = md.row_count as u32;
let mut d = ReadStatData::new().init(md, 0, row_count);
d.read_data(&rsp)?;

// Access the Arrow RecordBatch
if let Some(batch) = &d.batch {
    println!("Got {} rows x {} columns", batch.num_rows(), batch.num_columns());
    println!("Schema: {:?}", batch.schema());
}

§Stream data in chunks and write to Parquet

For large files, read in streaming chunks to control memory usage. Each chunk is parsed and written incrementally.

use readstat::{
    ReadStatPath, ReadStatMetadata, ReadStatData, ReadStatWriter,
    WriteConfig, OutFormat, build_offsets,
};

let rsp = ReadStatPath::new("data.sas7bdat".into())?;
let wc = WriteConfig::new(
    Some("output.parquet".into()),
    Some(OutFormat::Parquet),
    false, // overwrite
    None,  // compression (defaults to Snappy for Parquet)
    None,  // compression_level
)?;

let mut md = ReadStatMetadata::new();
md.read_metadata(&rsp, false)?;

// Build chunk offsets: [0, 10000, 20000, ..., row_count]
let offsets = build_offsets(md.row_count as u32, 10_000)?;
let mut wtr = ReadStatWriter::new();
let pairs = offsets.windows(2);
let pairs_cnt = pairs.len();

for (i, w) in pairs.enumerate() {
    let mut d = ReadStatData::new().init(md.clone(), w[0], w[1]);
    d.read_data(&rsp)?;
    wtr.write(&d, &wc)?;
    if i == pairs_cnt - 1 {
        wtr.finish(&d, &wc, &rsp.path)?;
    }
}

§Read from in-memory bytes

Parse a .sas7bdat file from a byte slice instead of the filesystem. Useful for cloud storage, HTTP uploads, WASM targets, and testing.

use readstat::{ReadStatMetadata, ReadStatData};

// sas_bytes: &[u8] — obtained from S3, HTTP, etc.
let mut md = ReadStatMetadata::new();
md.read_metadata_from_bytes(sas_bytes, false)?;

let row_count = md.row_count as u32;
let mut d = ReadStatData::new().init(md, 0, row_count);
d.read_data_from_bytes(sas_bytes)?;

if let Some(batch) = &d.batch {
    println!("Parsed {} rows from bytes", batch.num_rows());
}

§Filter to specific columns

Select only specific columns before reading data. Unselected columns are skipped during parsing, reducing both memory and CPU usage.

use readstat::{ReadStatPath, ReadStatMetadata, ReadStatData};
use std::sync::Arc;

let rsp = ReadStatPath::new("data.sas7bdat".into())?;

let mut md = ReadStatMetadata::new();
md.read_metadata(&rsp, false)?;

// Select only these columns
let columns = vec!["name".to_string(), "age".to_string()];
let filter = md.resolve_selected_columns(Some(columns))?;

if let Some(ref mapping) = filter {
    // Apply filter to metadata (updates schema and vars)
    let original_var_count = md.var_count;
    md = md.filter_to_selected_columns(mapping);

    let row_count = md.row_count as u32;
    let mut d = ReadStatData::new()
        .set_column_filter(Some(Arc::new(mapping.clone())), original_var_count)
        .init(md, 0, row_count);
    d.read_data(&rsp)?;

    if let Some(batch) = &d.batch {
        // batch only contains "name" and "age" columns
        println!(
            "Columns: {:?}",
            batch.schema().fields().iter().map(|f| f.name()).collect::<Vec<_>>()
        );
    }
}

§Convert RecordBatch to in-memory bytes

Serialize a parsed RecordBatch directly to in-memory bytes without writing to a file. Useful for HTTP responses, message queues, or piping to other Arrow-aware tools.

use readstat::{ReadStatPath, ReadStatMetadata, ReadStatData};
use readstat::write_batch_to_parquet_bytes;
use readstat::write_batch_to_csv_bytes;

let rsp = ReadStatPath::new("data.sas7bdat".into())?;

let mut md = ReadStatMetadata::new();
md.read_metadata(&rsp, false)?;

let row_count = md.row_count as u32;
let mut d = ReadStatData::new().init(md, 0, row_count);
d.read_data(&rsp)?;

if let Some(batch) = &d.batch {
    // Get Parquet bytes (e.g. for an HTTP response)
    let parquet_bytes = write_batch_to_parquet_bytes(batch)?;

    // Or CSV bytes
    let csv_bytes = write_batch_to_csv_bytes(batch)?;
}

§Key Types

§Features

Output format writers are feature-gated (all enabled by default):

FeatureFormatNotes
csvCSVComma-separated values via arrow-csv
parquetParquetColumnar format via parquet crate, 5 compression codecs
featherFeatherArrow IPC format via arrow-ipc
ndjsonNDJSONNewline-delimited JSON via arrow-json
sqlSQLQuery data with SQL via DataFusion (not enabled by default)

Modules§

cb 🔒
FFI callback functions invoked by the ReadStat C library during parsing.
common 🔒
Shared utility functions used across the crate.
err 🔒
Error types for the readstat crate.
formats 🔒
SAS format string classification using regex-based detection.
progress 🔒
Progress reporting trait for parsing feedback.
rs_buffer_io 🔒
Buffer-based I/O handlers for parsing SAS files from in-memory byte slices.
rs_data 🔒
Data reading and Arrow RecordBatch conversion.
rs_metadata 🔒
File-level and variable-level metadata extracted from .sas7bdat files.
rs_parser 🔒
Safe wrapper around the ReadStat C parser.
rs_path 🔒
Path validation for SAS file input.
rs_var 🔒
Variable types and format classification for SAS data.
rs_write 🔒
Output writers for converting Arrow RecordBatch data to CSV, Feather (Arrow IPC), NDJSON, or Parquet format.
rs_write_config 🔒
Output configuration for writing Arrow data to various formats.

Structs§

ReadStatData
Holds parsed row data from a .sas7bdat file and converts it to Arrow format.
ReadStatMetadata
File-level metadata extracted from a .sas7bdat file.
ReadStatPath
Validated file path for SAS file input.
ReadStatVarMetadata
Metadata for a single variable (column) in a SAS dataset.
ReadStatWriter
Manages writing Arrow [RecordBatch] data to the configured output format.
WriteConfig
Output configuration for writing Arrow data.

Enums§

OutFormat
Output file format for data conversion.
ParquetCompression
Parquet compression algorithm.
ReadStatCError
Error codes returned by the ReadStat C library.
ReadStatCompress
Compression method used in a .sas7bdat file.
ReadStatEndian
Byte order (endianness) of a .sas7bdat file.
ReadStatError
The main error type for the readstat crate.
ReadStatVarFormatClass
Semantic classification of a SAS format string.
ReadStatVarType
The storage type of a SAS variable, as reported by the ReadStat C library.
ReadStatVarTypeClass
High-level type class of a SAS variable: string or numeric.

Traits§

ProgressCallback
Trait for receiving progress updates during data parsing.

Functions§

build_offsets
Computes row offset boundaries for streaming chunk-based processing.
write_batch_to_csv_bytes
Serialize a RecordBatch to CSV bytes (with header).
write_batch_to_feather_bytes
Serialize a RecordBatch to Feather (Arrow IPC) bytes.
write_batch_to_ndjson_bytes
Serialize a RecordBatch to NDJSON bytes.
write_batch_to_parquet_bytes
Serialize a RecordBatch to Parquet bytes with Snappy compression.