Packer¶
The Packer reads processed data from S3 and converts it into Parquet files with automatic partitioning, schema evolution, and compression.
What It Does¶
- Watches for processed data written by Piper
- Reads batches and infers schema from the data
- Converts to columnar Parquet format
- Partitions output by time hierarchy
- Writes compressed Parquet files to your S3/MinIO storage
Partitioning¶
Output files are partitioned by a fixed hierarchy:
s3://bucket/parquet/{account_id}/{tenant_id}/{dataset_id}/year={YYYY}/month={MM}/day={DD}/hour={HH}/
This layout enables efficient time-range queries — query engines like DuckDB can skip irrelevant partitions entirely.
Schema Evolution¶
Packer handles changing data structures automatically:
- New fields are added to the Parquet schema as they appear
- Missing fields are written as null in the output
- Type conflicts are resolved by widening (e.g., int → string)
No manual schema management is required. As your data sources evolve, the Parquet output adapts.
Compression¶
| Codec | Ratio | Speed | Use Case |
|---|---|---|---|
| Snappy | Good | Fast | Default for installer deployments |
| Zstd | Better | Moderate | Maximum compression |
Configuration¶
app:
name: "bytefreezer-packer"
server:
api_port: 8083
s3source:
bucket_name: "bytefreezer-piper" # Reads processed data from Piper
region: "us-east-1"
endpoint: "minio:9000" # or s3.amazonaws.com
ssl: false
use_iam_role: false
control_service:
enabled: true
control_url: "https://api.bytefreezer.com"
timeout_seconds: 30
parquet:
max_file_size_mb: 128 # Target file size
timeout_seconds: 300 # Max time before flushing
compression: "snappy" # snappy or zstd
streaming_mode: true
memory_buffer_mb: 64
atomic_upload: true
health_reporting:
enabled: true
report_interval: 30
register_on_startup: true
S3 credentials and the control API key are provided via the installer's .env file or Kubernetes Secrets. Packer writes to per-tenant S3 destinations configured via the Control API. See the installer project for deployment templates.
Output Format¶
Files are written as standard Apache Parquet and can be read by any compatible tool: DuckDB, Spark, Pandas, Athena, BigQuery, etc.