Expand description
Read SAS binary files (.sas7bdat) and convert them to other formats.
This crate provides a library for parsing SAS binary data files using FFI
bindings to the ReadStat C library,
then converting the parsed data into Apache Arrow RecordBatch
format for output as CSV, Feather (Arrow IPC), NDJSON, or Parquet.
Note: While the underlying readstat-sys crate
exposes bindings for all formats supported by ReadStat (SAS, SPSS, Stata),
this crate currently only implements parsing and conversion for SAS .sas7bdat files.
§Data Pipeline
.sas7bdat file
→ ReadStat C library (FFI parsing via callbacks)
→ Typed Arrow builders (StringBuilder, Float64Builder, etc.)
→ Arrow RecordBatch
→ Output format (CSV / Feather / NDJSON / Parquet)§Examples
§Inspect file metadata
Read metadata without loading any row data. Useful for discovering schema, row counts, variable types, and SAS format classifications.
use readstat::{ReadStatPath, ReadStatMetadata};
let rsp = ReadStatPath::new("data.sas7bdat".into())?;
let mut md = ReadStatMetadata::new();
md.read_metadata(&rsp, false)?;
println!("Rows: {}, Variables: {}", md.row_count, md.var_count);
println!("Encoding: {}", md.file_encoding);
println!("Compression: {:?}", md.compression);
// Iterate over variable metadata
for (idx, var) in &md.vars {
println!(
" [{idx}] {} ({:?}, format: {})",
var.var_name, var.var_type_class, var.var_format
);
}
// The Arrow schema is also available
println!("Schema: {:?}", md.schema);§Read all data into Arrow RecordBatch
Parse the entire file into a single Arrow RecordBatch.
Best for smaller files that fit comfortably in memory.
use readstat::{ReadStatPath, ReadStatMetadata, ReadStatData};
let rsp = ReadStatPath::new("data.sas7bdat".into())?;
// Read metadata first
let mut md = ReadStatMetadata::new();
md.read_metadata(&rsp, false)?;
// Read all rows into a single chunk
let row_count = md.row_count as u32;
let mut d = ReadStatData::new().init(md, 0, row_count);
d.read_data(&rsp)?;
// Access the Arrow RecordBatch
if let Some(batch) = &d.batch {
println!("Got {} rows x {} columns", batch.num_rows(), batch.num_columns());
println!("Schema: {:?}", batch.schema());
}§Stream data in chunks and write to Parquet
For large files, read in streaming chunks to control memory usage. Each chunk is parsed and written incrementally.
use readstat::{
ReadStatPath, ReadStatMetadata, ReadStatData, ReadStatWriter,
WriteConfig, OutFormat, build_offsets,
};
let rsp = ReadStatPath::new("data.sas7bdat".into())?;
let wc = WriteConfig::new(
Some("output.parquet".into()),
Some(OutFormat::Parquet),
false, // overwrite
None, // compression (defaults to Snappy for Parquet)
None, // compression_level
)?;
let mut md = ReadStatMetadata::new();
md.read_metadata(&rsp, false)?;
// Build chunk offsets: [0, 10000, 20000, ..., row_count]
let offsets = build_offsets(md.row_count as u32, 10_000)?;
let mut wtr = ReadStatWriter::new();
let pairs = offsets.windows(2);
let pairs_cnt = pairs.len();
for (i, w) in pairs.enumerate() {
let mut d = ReadStatData::new().init(md.clone(), w[0], w[1]);
d.read_data(&rsp)?;
wtr.write(&d, &wc)?;
if i == pairs_cnt - 1 {
wtr.finish(&d, &wc, &rsp.path)?;
}
}§Read from in-memory bytes
Parse a .sas7bdat file from a byte slice instead of the filesystem.
Useful for cloud storage, HTTP uploads, WASM targets, and testing.
use readstat::{ReadStatMetadata, ReadStatData};
// sas_bytes: &[u8] — obtained from S3, HTTP, etc.
let mut md = ReadStatMetadata::new();
md.read_metadata_from_bytes(sas_bytes, false)?;
let row_count = md.row_count as u32;
let mut d = ReadStatData::new().init(md, 0, row_count);
d.read_data_from_bytes(sas_bytes)?;
if let Some(batch) = &d.batch {
println!("Parsed {} rows from bytes", batch.num_rows());
}§Filter to specific columns
Select only specific columns before reading data. Unselected columns are skipped during parsing, reducing both memory and CPU usage.
use readstat::{ReadStatPath, ReadStatMetadata, ReadStatData};
use std::sync::Arc;
let rsp = ReadStatPath::new("data.sas7bdat".into())?;
let mut md = ReadStatMetadata::new();
md.read_metadata(&rsp, false)?;
// Select only these columns
let columns = vec!["name".to_string(), "age".to_string()];
let filter = md.resolve_selected_columns(Some(columns))?;
if let Some(ref mapping) = filter {
// Apply filter to metadata (updates schema and vars)
let original_var_count = md.var_count;
md = md.filter_to_selected_columns(mapping);
let row_count = md.row_count as u32;
let mut d = ReadStatData::new()
.set_column_filter(Some(Arc::new(mapping.clone())), original_var_count)
.init(md, 0, row_count);
d.read_data(&rsp)?;
if let Some(batch) = &d.batch {
// batch only contains "name" and "age" columns
println!(
"Columns: {:?}",
batch.schema().fields().iter().map(|f| f.name()).collect::<Vec<_>>()
);
}
}§Convert RecordBatch to in-memory bytes
Serialize a parsed RecordBatch directly to
in-memory bytes without writing to a file. Useful for HTTP responses,
message queues, or piping to other Arrow-aware tools.
use readstat::{ReadStatPath, ReadStatMetadata, ReadStatData};
use readstat::write_batch_to_parquet_bytes;
use readstat::write_batch_to_csv_bytes;
let rsp = ReadStatPath::new("data.sas7bdat".into())?;
let mut md = ReadStatMetadata::new();
md.read_metadata(&rsp, false)?;
let row_count = md.row_count as u32;
let mut d = ReadStatData::new().init(md, 0, row_count);
d.read_data(&rsp)?;
if let Some(batch) = &d.batch {
// Get Parquet bytes (e.g. for an HTTP response)
let parquet_bytes = write_batch_to_parquet_bytes(batch)?;
// Or CSV bytes
let csv_bytes = write_batch_to_csv_bytes(batch)?;
}§Key Types
ReadStatPath— Validated input file path for SAS filesWriteConfig— Output configuration (path, format, compression)ReadStatMetadata— File-level metadata (row/var counts, encoding, Arrow schema)ReadStatData— Parsed row data, convertible to ArrowRecordBatchReadStatVarFormatClass— SAS format classification (Date,DateTime, Time variants)ReadStatWriter— Writes Arrow batches to the configured output format
§Features
Output format writers are feature-gated (all enabled by default):
| Feature | Format | Notes |
|---|---|---|
csv | CSV | Comma-separated values via arrow-csv |
parquet | Parquet | Columnar format via parquet crate, 5 compression codecs |
feather | Feather | Arrow IPC format via arrow-ipc |
ndjson | NDJSON | Newline-delimited JSON via arrow-json |
sql | SQL | Query data with SQL via DataFusion (not enabled by default) |
Modules§
- cb 🔒
- FFI callback functions invoked by the
ReadStatC library during parsing. - common 🔒
- Shared utility functions used across the crate.
- err 🔒
- Error types for the readstat crate.
- formats 🔒
- SAS format string classification using regex-based detection.
- progress 🔒
- Progress reporting trait for parsing feedback.
- rs_
buffer_ 🔒io - Buffer-based I/O handlers for parsing SAS files from in-memory byte slices.
- rs_data 🔒
- Data reading and Arrow
RecordBatchconversion. - rs_
metadata 🔒 - File-level and variable-level metadata extracted from
.sas7bdatfiles. - rs_
parser 🔒 - Safe wrapper around the
ReadStatC parser. - rs_path 🔒
- Path validation for SAS file input.
- rs_var 🔒
- Variable types and format classification for SAS data.
- rs_
write 🔒 - Output writers for converting Arrow
RecordBatchdata to CSV, Feather (Arrow IPC), NDJSON, or Parquet format. - rs_
write_ 🔒config - Output configuration for writing Arrow data to various formats.
Structs§
- Read
Stat Data - Holds parsed row data from a
.sas7bdatfile and converts it to Arrow format. - Read
Stat Metadata - File-level metadata extracted from a
.sas7bdatfile. - Read
Stat Path - Validated file path for SAS file input.
- Read
Stat VarMetadata - Metadata for a single variable (column) in a SAS dataset.
- Read
Stat Writer - Manages writing Arrow [
RecordBatch] data to the configured output format. - Write
Config - Output configuration for writing Arrow data.
Enums§
- OutFormat
- Output file format for data conversion.
- Parquet
Compression - Parquet compression algorithm.
- Read
StatC Error - Error codes returned by the
ReadStatC library. - Read
Stat Compress - Compression method used in a
.sas7bdatfile. - Read
Stat Endian - Byte order (endianness) of a
.sas7bdatfile. - Read
Stat Error - The main error type for the readstat crate.
- Read
Stat VarFormat Class - Semantic classification of a SAS format string.
- Read
Stat VarType - The storage type of a SAS variable, as reported by the
ReadStatC library. - Read
Stat VarType Class - High-level type class of a SAS variable: string or numeric.
Traits§
- Progress
Callback - Trait for receiving progress updates during data parsing.
Functions§
- build_
offsets - Computes row offset boundaries for streaming chunk-based processing.
- write_
batch_ to_ csv_ bytes - Serialize a
RecordBatchto CSV bytes (with header). - write_
batch_ to_ feather_ bytes - Serialize a
RecordBatchto Feather (Arrow IPC) bytes. - write_
batch_ to_ ndjson_ bytes - Serialize a
RecordBatchto NDJSON bytes. - write_
batch_ to_ parquet_ bytes - Serialize a
RecordBatchto Parquet bytes with Snappy compression.