Packer¶

The Packer reads processed data from S3 and converts it into Parquet files with automatic partitioning, schema evolution, and compression.

What It Does¶

Watches for processed data written by Piper
Reads batches and infers schema from the data
Converts to columnar Parquet format
Partitions output by time hierarchy
Writes compressed Parquet files to your S3/MinIO storage

Partitioning¶

Output files are partitioned by a fixed hierarchy:

s3://bucket/parquet/{account_id}/{tenant_id}/{dataset_id}/year={YYYY}/month={MM}/day={DD}/hour={HH}/

This layout enables efficient time-range queries — query engines like DuckDB can skip irrelevant partitions entirely.

Schema Evolution¶

Packer handles changing data structures automatically:

New fields are added to the Parquet schema as they appear
Missing fields are written as null in the output
Type conflicts are resolved by widening (e.g., int → string)

No manual schema management is required. As your data sources evolve, the Parquet output adapts.

Compression¶

Codec	Ratio	Speed	Use Case
Snappy	Good	Fast	Default for installer deployments
Zstd	Better	Moderate	Maximum compression

Configuration¶

app:
  name: "bytefreezer-packer"

server:
  api_port: 8083

s3source:
  bucket_name: "bytefreezer-piper"         # Reads processed data from Piper
  region: "us-east-1"
  endpoint: "minio:9000"                   # or s3.amazonaws.com
  ssl: false
  use_iam_role: false

control_service:
  enabled: true
  control_url: "https://api.bytefreezer.com"
  timeout_seconds: 30

parquet:
  max_file_size_mb: 128                    # Target file size
  timeout_seconds: 300                     # Max time before flushing
  compression: "snappy"                    # snappy or zstd
  streaming_mode: true
  memory_buffer_mb: 64
  atomic_upload: true

health_reporting:
  enabled: true
  report_interval: 30
  register_on_startup: true

S3 credentials and the control API key are provided via the installer's .env file or Kubernetes Secrets. Packer writes to per-tenant S3 destinations configured via the Control API. See the installer project for deployment templates.

Output Format¶

Files are written as standard Apache Parquet and can be read by any compatible tool: DuckDB, Spark, Pandas, Athena, BigQuery, etc.