pub struct ReadStatData {Show 14 fields
pub var_count: i32,
pub vars: Arc<BTreeMap<i32, ReadStatVarMetadata>>,
pub(crate) builders: Vec<ColumnBuilder>,
pub schema: Arc<Schema>,
pub batch: Option<RecordBatch>,
pub chunk_rows_to_process: usize,
pub(crate) chunk_row_start: usize,
pub(crate) chunk_row_end: usize,
pub(crate) chunk_rows_processed: usize,
pub(crate) total_rows_processed: Option<Arc<AtomicUsize>>,
pub(crate) progress: Option<Arc<dyn ProgressCallback>>,
pub(crate) abort_error: Option<ReadStatError>,
pub(crate) column_filter: Option<Arc<BTreeMap<i32, i32>>>,
pub(crate) total_var_count: i32,
}Expand description
Holds parsed row data from a .sas7bdat file and converts it to Arrow format.
Each instance processes one streaming chunk of rows. Values are appended
directly into typed Arrow ColumnBuilders during the handle_value
callback, then finished into an Arrow [RecordBatch] via cols_to_batch.
Fields§
§var_count: i32Number of variables (columns) in the dataset.
vars: Arc<BTreeMap<i32, ReadStatVarMetadata>>Per-variable metadata, keyed by variable index.
Wrapped in Arc so parallel chunks share the same metadata without deep cloning.
builders: Vec<ColumnBuilder>Typed Arrow builders — one per variable, pre-sized with capacity hints.
schema: Arc<Schema>Arrow schema for the dataset.
Wrapped in Arc for cheap sharing across parallel chunks.
batch: Option<RecordBatch>The Arrow RecordBatch produced after parsing, if available.
chunk_rows_to_process: usizeNumber of rows to process in this chunk.
chunk_row_start: usizeStarting row offset for this chunk.
chunk_row_end: usizeEnding row offset (exclusive) for this chunk.
chunk_rows_processed: usizeNumber of rows actually processed so far in this chunk.
total_rows_processed: Option<Arc<AtomicUsize>>Shared atomic counter of total rows processed across all chunks.
progress: Option<Arc<dyn ProgressCallback>>Optional progress callback for visual feedback during parsing.
abort_error: Option<ReadStatError>A typed error raised by a value callback that aborted parsing.
Set by handle_value (e.g. on date/time overflow or a builder/value
type mismatch) and surfaced by the parse routines in preference to the
generic USER_ABORT the C library reports for any callback abort.
column_filter: Option<Arc<BTreeMap<i32, i32>>>Optional mapping: original var index -> filtered column index.
Wrapped in Arc so parallel chunks share the same filter without deep cloning.
total_var_count: i32Total variable count in the unfiltered dataset.
Used for row-boundary detection in handle_value when filtering is active.
Defaults to var_count when no filter is set.
Implementations§
Source§impl ReadStatData
impl ReadStatData
Sourcepub fn allocate_builders(self) -> Self
pub fn allocate_builders(self) -> Self
Allocates typed Arrow builders with capacity for chunk_rows_to_process.
Each builder’s type is determined by the variable metadata. String builders
are additionally pre-sized with storage_width * chunk_rows bytes.
The capacity hint is clamped to MAX_PREALLOC_ROWS because both the row
count and per-string storage_width originate from untrusted file headers;
a crafted file claiming billions of rows would otherwise trigger a multi-GB
up-front allocation (or a multiply overflow) before a single row is parsed.
Builders grow on demand, so clamping costs honest files nothing.
Sourcepub(crate) fn cols_to_batch(&mut self) -> Result<(), ReadStatError>
pub(crate) fn cols_to_batch(&mut self) -> Result<(), ReadStatError>
Finishes all builders and assembles the Arrow [RecordBatch].
Each builder produces its final array via finish(), which is an O(1)
operation (no data copying). The heavy work was already done during
handle_value when values were appended directly into the builders.
Sourcepub(crate) fn note_value(&mut self, var_index: i32)
pub(crate) fn note_value(&mut self, var_index: i32)
Records that a value was observed for var_index during parsing.
When var_index is the dataset’s final variable, the cell marks the end
of a row, so the per-chunk and shared row counters are advanced. Boundary
detection uses total_var_count (the unfiltered variable count) so it
stays correct even when a column filter skips trailing columns.
Called from the value callback for both stored and filter-skipped cells, keeping row-boundary accounting in a single place.
Sourcepub fn read_data(&mut self, rsp: &ReadStatPath) -> Result<(), ReadStatError>
pub fn read_data(&mut self, rsp: &ReadStatPath) -> Result<(), ReadStatError>
Parses row data from the file and converts it to an Arrow [RecordBatch].
§Errors
Returns ReadStatError if FFI parsing or Arrow conversion fails.
Sourcepub fn read_data_from_bytes(
&mut self,
bytes: &[u8],
) -> Result<(), ReadStatError>
pub fn read_data_from_bytes( &mut self, bytes: &[u8], ) -> Result<(), ReadStatError>
Parses row data from an in-memory byte slice and converts it to an Arrow [RecordBatch].
Equivalent to read_data but reads from a &[u8]
buffer instead of a file path.
§Errors
Returns ReadStatError if FFI parsing or Arrow conversion fails.
Sourcepub fn read_data_from_mmap(&mut self, path: &Path) -> Result<(), ReadStatError>
pub fn read_data_from_mmap(&mut self, path: &Path) -> Result<(), ReadStatError>
Parses row data from a memory-mapped .sas7bdat file and converts it to an Arrow [RecordBatch].
Opens the file at path and memory-maps it, avoiding explicit read syscalls.
Especially beneficial for large files and repeated chunk reads against the
same file, as the OS manages page caching automatically.
§Safety
Memory mapping is safe as long as the file is not modified or truncated by another process while the map is active.
§Errors
Returns ReadStatError if the file cannot be opened, mapped, or parsed.
Sourcepub(crate) fn parse_data(
&mut self,
rsp: &ReadStatPath,
) -> Result<(), ReadStatError>
pub(crate) fn parse_data( &mut self, rsp: &ReadStatPath, ) -> Result<(), ReadStatError>
Parses row data from the file via FFI callbacks (without Arrow conversion).
fn parse_data_from_bytes(&mut self, bytes: &[u8]) -> Result<(), ReadStatError>
Sourcepub fn init(self, md: ReadStatMetadata, row_start: u32, row_end: u32) -> Self
pub fn init(self, md: ReadStatMetadata, row_start: u32, row_end: u32) -> Self
Initializes this instance with metadata and chunk boundaries, allocating builders.
Wraps vars and schema in Arc internally. For the parallel read path,
prefer init_shared which accepts pre-wrapped
Arcs to avoid repeated deep clones.
Sourcepub fn init_filtered(
self,
md: ReadStatMetadata,
mapping: &BTreeMap<i32, i32>,
row_start: u32,
row_end: u32,
) -> Self
pub fn init_filtered( self, md: ReadStatMetadata, mapping: &BTreeMap<i32, i32>, row_start: u32, row_end: u32, ) -> Self
Initializes this instance with a column filter applied, in one step.
Combines set_column_filter and
init in the correct order so callers cannot
accidentally invoke them the wrong way around (which would clobber the
original variable count needed for row-boundary detection).
md must be the original, unfiltered metadata and mapping the
result of ReadStatMetadata::resolve_selected_columns. The filtered
metadata and the original variable count are derived internally.
use readstat::{ReadStatPath, ReadStatMetadata, ReadStatData};
let rsp = ReadStatPath::new("data.sas7bdat")?;
let mut md = ReadStatMetadata::new();
md.read_metadata(&rsp, false)?;
if let Some(mapping) = md.resolve_selected_columns(Some(vec!["name".into(), "age".into()]))? {
let row_count = u32::try_from(md.row_count)?;
let mut d = ReadStatData::new().init_filtered(md, &mapping, 0, row_count);
d.read_data(&rsp)?;
}Initializes this instance with pre-shared metadata and chunk boundaries.
Accepts Arc-wrapped vars and schema for cheap cloning in parallel loops.
Each call only increments reference counts (atomic +1) instead of deep-cloning
the entire metadata tree.
fn set_chunk_counts(self, row_start: u32, row_end: u32) -> Self
fn set_metadata(self, md: ReadStatMetadata) -> Self
Sourcepub fn set_total_rows_processed(
self,
total_rows_processed: Arc<AtomicUsize>,
) -> Self
pub fn set_total_rows_processed( self, total_rows_processed: Arc<AtomicUsize>, ) -> Self
Sets the shared atomic counter for tracking rows processed across chunks.
Sourcepub fn set_column_filter(
self,
filter: Option<Arc<BTreeMap<i32, i32>>>,
total_var_count: i32,
) -> Self
pub fn set_column_filter( self, filter: Option<Arc<BTreeMap<i32, i32>>>, total_var_count: i32, ) -> Self
Sets the column filter and original (unfiltered) variable count.
Accepts an Arc-wrapped filter for cheap sharing across parallel chunks.
Must be called before init so that
total_var_count is preserved when set_metadata runs.
Sourcepub fn set_progress(self, progress: Arc<dyn ProgressCallback>) -> Self
pub fn set_progress(self, progress: Arc<dyn ProgressCallback>) -> Self
Attaches a progress callback for feedback during parsing.
The callback receives progress increments and parsing status updates.
See ProgressCallback for the required interface.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for ReadStatData
impl !RefUnwindSafe for ReadStatData
impl Send for ReadStatData
impl Sync for ReadStatData
impl Unpin for ReadStatData
impl UnsafeUnpin for ReadStatData
impl !UnwindSafe for ReadStatData
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more