Skip to content

Storage

ByteFreezer stores processed data in Parquet format on S3-compatible object storage. This provides efficient, cost-effective, long-term storage with excellent query performance.

Why Parquet?

Feature Benefit
Columnar format Query only the columns you need
Compression 10x smaller than JSON
Schema evolution Add fields without breaking queries
Predicate pushdown Skip irrelevant data during queries
Industry standard Works with every analytics tool

Storage Architecture

S3/MinIO Bucket
└── bytefreezer/
    └── {account_id}/
        └── {tenant_id}/
            └── {dataset_id}/
                └── year={YYYY}/
                    └── month={MM}/
                        └── day={DD}/
                            └── hour={HH}/
                                ├── data_0001.parquet
                                ├── data_0002.parquet
                                └── data_0003.parquet

Auto-Partitioning

Data is automatically partitioned by:

  1. Account - Top-level isolation
  2. Tenant - Logical data grouping
  3. Dataset - Individual data stream
  4. Time - Year/Month/Day/Hour partitions

This enables efficient time-range queries and easy data lifecycle management.

BYOB - Bring Your Own Bucket

ByteFreezer supports using your own S3-compatible storage:

Provider Supported
AWS S3 Yes
MinIO Yes
Google Cloud Storage Yes (S3 compatibility mode)
Azure Blob Yes (S3 compatibility mode)
Backblaze B2 Yes
Wasabi Yes

Configuration

storage:
  type: s3
  bucket: my-bytefreezer-data
  region: us-east-1
  endpoint: ""  # Leave empty for AWS S3

  # For MinIO or other S3-compatible
  # endpoint: https://minio.example.com:9000

  credentials:
    access_key: ${AWS_ACCESS_KEY_ID}
    secret_key: ${AWS_SECRET_ACCESS_KEY}

Schema Evolution

ByteFreezer handles changing data schemas gracefully:

Adding Fields

New fields are automatically added to the schema:

// Day 1
{"timestamp": "...", "user": "alice", "action": "login"}

// Day 2 - new field appears
{"timestamp": "...", "user": "bob", "action": "login", "mfa": true}

The Parquet files seamlessly accommodate the new mfa field.

Missing Fields

Queries handle missing fields gracefully:

SELECT user, action, mfa
FROM events
WHERE date = '2024-01-15'
-- mfa will be NULL for events before the field was added

Data Lifecycle

Retention Policies

Configure retention per dataset:

datasets:
  - id: security-logs
    retention_days: 365  # Keep for 1 year

  - id: debug-logs
    retention_days: 7    # Keep for 1 week

  - id: audit-logs
    retention_days: 2555 # Keep for 7 years

Tiered Storage

For long-term retention, use S3 lifecycle policies:

Age Storage Class Cost
0-30 days Standard $$$
30-90 days Infrequent Access $$
90+ days Glacier $

Compression

Parquet files use efficient compression:

Codec Compression Ratio Query Speed
Snappy Good Fast
Zstd Better Moderate
Gzip Best Slower

Default: Snappy for balance of compression and speed.

Cost Comparison

Storage costs compared to traditional SIEM:

Solution 1TB/day for 90 days Annual Cost
Traditional SIEM ~$100K+ ~$1.2M+
ByteFreezer + S3 ~$2K ~$24K

Estimates based on public pricing. Your costs may vary.

Best Practices

Optimize Partitioning

  1. Use appropriate granularity - Hourly for high-volume, daily for low-volume
  2. Query with partition filters - Always include time ranges

Manage File Sizes

  1. Target 128MB-1GB files - Optimal for S3 and query engines
  2. Packer handles this - Automatic file sizing and compaction

Secure Your Bucket

  1. Enable encryption - Server-side encryption (SSE-S3 or SSE-KMS)
  2. Restrict access - IAM policies for ByteFreezer only
  3. Enable versioning - Protect against accidental deletion
  4. Enable access logging - Audit bucket access

Air-Gapped / FedRAMP

For high-security environments:

  • On-premises MinIO - No cloud dependency
  • Network isolation - No internet required
  • Your encryption keys - Full control over data
  • BYOA - Bring Your Own AI model for queries

See Control Deployment for air-gapped setup.