Storage¶

ByteFreezer stores processed data in Parquet format on S3-compatible object storage. This provides efficient, cost-effective, long-term storage with excellent query performance.

Why Parquet?¶

Feature	Benefit
Columnar format	Query only the columns you need
Compression	10x smaller than JSON
Schema evolution	Add fields without breaking queries
Predicate pushdown	Skip irrelevant data during queries
Industry standard	Works with every analytics tool

Storage Architecture¶

S3/MinIO Bucket
└── bytefreezer/
    └── {account_id}/
        └── {tenant_id}/
            └── {dataset_id}/
                └── year={YYYY}/
                    └── month={MM}/
                        └── day={DD}/
                            └── hour={HH}/
                                ├── data_0001.parquet
                                ├── data_0002.parquet
                                └── data_0003.parquet

Auto-Partitioning¶

Data is automatically partitioned by:

Account - Top-level isolation
Tenant - Logical data grouping
Dataset - Individual data stream
Time - Year/Month/Day/Hour partitions

This enables efficient time-range queries and easy data lifecycle management.

BYOB - Bring Your Own Bucket¶

ByteFreezer supports using your own S3-compatible storage:

Provider	Supported
AWS S3	Yes
MinIO	Yes
Google Cloud Storage	Yes (S3 compatibility mode)
Azure Blob	Yes (S3 compatibility mode)
Backblaze B2	Yes
Wasabi	Yes

Configuration¶

storage:
  type: s3
  bucket: my-bytefreezer-data
  region: us-east-1
  endpoint: ""  # Leave empty for AWS S3

  # For MinIO or other S3-compatible
  # endpoint: https://minio.example.com:9000

  credentials:
    access_key: ${AWS_ACCESS_KEY_ID}
    secret_key: ${AWS_SECRET_ACCESS_KEY}

Schema Evolution¶

ByteFreezer handles changing data schemas gracefully:

Adding Fields¶

New fields are automatically added to the schema:

// Day 1
{"timestamp": "...", "user": "alice", "action": "login"}

// Day 2 - new field appears
{"timestamp": "...", "user": "bob", "action": "login", "mfa": true}

The Parquet files seamlessly accommodate the new mfa field.

Missing Fields¶

Queries handle missing fields gracefully:

SELECT user, action, mfa
FROM events
WHERE date = '2024-01-15'
-- mfa will be NULL for events before the field was added

Data Lifecycle¶

Retention Policies¶

Configure retention per dataset:

datasets:
  - id: security-logs
    retention_days: 365  # Keep for 1 year

  - id: debug-logs
    retention_days: 7    # Keep for 1 week

  - id: audit-logs
    retention_days: 2555 # Keep for 7 years

Tiered Storage¶

For long-term retention, use S3 lifecycle policies:

Age	Storage Class	Cost
0-30 days	Standard	$$$
30-90 days	Infrequent Access	$$
90+ days	Glacier	$

Compression¶

Parquet files use efficient compression:

Codec	Compression Ratio	Query Speed
Snappy	Good	Fast
Zstd	Better	Moderate
Gzip	Best	Slower

Default: Snappy for balance of compression and speed.

Cost Comparison¶

Storage costs compared to traditional SIEM:

Solution	1TB/day for 90 days	Annual Cost
Traditional SIEM	~$100K+	~$1.2M+
ByteFreezer + S3	~$2K	~$24K

Estimates based on public pricing. Your costs may vary.

Best Practices¶

Optimize Partitioning¶

Use appropriate granularity - Hourly for high-volume, daily for low-volume
Query with partition filters - Always include time ranges

Manage File Sizes¶

Target 128MB-1GB files - Optimal for S3 and query engines
Packer handles this - Automatic file sizing and compaction

Secure Your Bucket¶

Enable encryption - Server-side encryption (SSE-S3 or SSE-KMS)
Restrict access - IAM policies for ByteFreezer only
Enable versioning - Protect against accidental deletion
Enable access logging - Audit bucket access

Air-Gapped Deployments¶

For high-security environments:

On-premises MinIO - No cloud dependency
Network isolation - No internet required
Your encryption keys - Full control over data
BYOA - Bring Your Own AI model for queries

See Control Deployment for air-gapped setup.