Skip to content

Packer

The Packer reads processed data from S3 and converts it into Parquet files with automatic partitioning, schema evolution, and compression.

What It Does

  1. Watches for processed data written by Piper
  2. Reads batches and infers schema from the data
  3. Converts to columnar Parquet format
  4. Partitions output by time hierarchy
  5. Writes compressed Parquet files to your S3/MinIO storage

Partitioning

Output files are partitioned by a fixed hierarchy:

s3://bucket/parquet/{account_id}/{tenant_id}/{dataset_id}/year={YYYY}/month={MM}/day={DD}/hour={HH}/

This layout enables efficient time-range queries — query engines like DuckDB can skip irrelevant partitions entirely.

Schema Evolution

Packer handles changing data structures automatically:

  • New fields are added to the Parquet schema as they appear
  • Missing fields are written as null in the output
  • Type conflicts are resolved by widening (e.g., int → string)

No manual schema management is required. As your data sources evolve, the Parquet output adapts.

Compression

Codec Ratio Speed Use Case
Snappy Good Fast Default for installer deployments
Zstd Better Moderate Maximum compression

Configuration

app:
  name: "bytefreezer-packer"

server:
  api_port: 8083

s3source:
  bucket_name: "bytefreezer-piper"         # Reads processed data from Piper
  region: "us-east-1"
  endpoint: "minio:9000"                   # or s3.amazonaws.com
  ssl: false
  use_iam_role: false

control_service:
  enabled: true
  control_url: "https://api.bytefreezer.com"
  timeout_seconds: 30

parquet:
  max_file_size_mb: 128                    # Target file size
  timeout_seconds: 300                     # Max time before flushing
  compression: "snappy"                    # snappy or zstd
  streaming_mode: true
  memory_buffer_mb: 64
  atomic_upload: true

health_reporting:
  enabled: true
  report_interval: 30
  register_on_startup: true

S3 credentials and the control API key are provided via the installer's .env file or Kubernetes Secrets. Packer writes to per-tenant S3 destinations configured via the Control API. See the installer project for deployment templates.

Output Format

Files are written as standard Apache Parquet and can be read by any compatible tool: DuckDB, Spark, Pandas, Athena, BigQuery, etc.