Storage¶
ByteFreezer stores processed data in Parquet format on S3-compatible object storage. This provides efficient, cost-effective, long-term storage with excellent query performance.
Why Parquet?¶
| Feature | Benefit |
|---|---|
| Columnar format | Query only the columns you need |
| Compression | 10x smaller than JSON |
| Schema evolution | Add fields without breaking queries |
| Predicate pushdown | Skip irrelevant data during queries |
| Industry standard | Works with every analytics tool |
Storage Architecture¶
S3/MinIO Bucket
└── bytefreezer/
└── {account_id}/
└── {tenant_id}/
└── {dataset_id}/
└── year={YYYY}/
└── month={MM}/
└── day={DD}/
└── hour={HH}/
├── data_0001.parquet
├── data_0002.parquet
└── data_0003.parquet
Auto-Partitioning¶
Data is automatically partitioned by:
- Account - Top-level isolation
- Tenant - Logical data grouping
- Dataset - Individual data stream
- Time - Year/Month/Day/Hour partitions
This enables efficient time-range queries and easy data lifecycle management.
BYOB - Bring Your Own Bucket¶
ByteFreezer supports using your own S3-compatible storage:
| Provider | Supported |
|---|---|
| AWS S3 | Yes |
| MinIO | Yes |
| Google Cloud Storage | Yes (S3 compatibility mode) |
| Azure Blob | Yes (S3 compatibility mode) |
| Backblaze B2 | Yes |
| Wasabi | Yes |
Configuration¶
storage:
type: s3
bucket: my-bytefreezer-data
region: us-east-1
endpoint: "" # Leave empty for AWS S3
# For MinIO or other S3-compatible
# endpoint: https://minio.example.com:9000
credentials:
access_key: ${AWS_ACCESS_KEY_ID}
secret_key: ${AWS_SECRET_ACCESS_KEY}
Schema Evolution¶
ByteFreezer handles changing data schemas gracefully:
Adding Fields¶
New fields are automatically added to the schema:
// Day 1
{"timestamp": "...", "user": "alice", "action": "login"}
// Day 2 - new field appears
{"timestamp": "...", "user": "bob", "action": "login", "mfa": true}
The Parquet files seamlessly accommodate the new mfa field.
Missing Fields¶
Queries handle missing fields gracefully:
SELECT user, action, mfa
FROM events
WHERE date = '2024-01-15'
-- mfa will be NULL for events before the field was added
Data Lifecycle¶
Retention Policies¶
Configure retention per dataset:
datasets:
- id: security-logs
retention_days: 365 # Keep for 1 year
- id: debug-logs
retention_days: 7 # Keep for 1 week
- id: audit-logs
retention_days: 2555 # Keep for 7 years
Tiered Storage¶
For long-term retention, use S3 lifecycle policies:
| Age | Storage Class | Cost |
|---|---|---|
| 0-30 days | Standard | $$$ |
| 30-90 days | Infrequent Access | $$ |
| 90+ days | Glacier | $ |
Compression¶
Parquet files use efficient compression:
| Codec | Compression Ratio | Query Speed |
|---|---|---|
| Snappy | Good | Fast |
| Zstd | Better | Moderate |
| Gzip | Best | Slower |
Default: Snappy for balance of compression and speed.
Cost Comparison¶
Storage costs compared to traditional SIEM:
| Solution | 1TB/day for 90 days | Annual Cost |
|---|---|---|
| Traditional SIEM | ~$100K+ | ~$1.2M+ |
| ByteFreezer + S3 | ~$2K | ~$24K |
Estimates based on public pricing. Your costs may vary.
Best Practices¶
Optimize Partitioning¶
- Use appropriate granularity - Hourly for high-volume, daily for low-volume
- Query with partition filters - Always include time ranges
Manage File Sizes¶
- Target 128MB-1GB files - Optimal for S3 and query engines
- Packer handles this - Automatic file sizing and compaction
Secure Your Bucket¶
- Enable encryption - Server-side encryption (SSE-S3 or SSE-KMS)
- Restrict access - IAM policies for ByteFreezer only
- Enable versioning - Protect against accidental deletion
- Enable access logging - Audit bucket access
Air-Gapped / FedRAMP¶
For high-security environments:
- On-premises MinIO - No cloud dependency
- Network isolation - No internet required
- Your encryption keys - Full control over data
- BYOA - Bring Your Own AI model for queries
See Control Deployment for air-gapped setup.