Query & AI¶
ByteFreezer stores data in Parquet format, enabling powerful querying through SQL engines and AI agents.
Query Options¶
| Engine | Best For | Integration |
|---|---|---|
| DuckDB | Fast SQL queries, embedded | Built-in |
| AI Agents | Natural language queries | Built-in / BYOA |
| Trino/Presto | Distributed queries | External |
| ClickHouse | Real-time analytics | External |
| Grafana | Dashboards & visualization | Plugin |
DuckDB Integration¶
DuckDB provides lightning-fast SQL queries directly against Parquet files in S3.
Why DuckDB?¶
- Zero dependencies - Single binary, no cluster required
- Direct S3 access - Query Parquet files in place
- Blazing fast - Columnar engine optimized for analytics
- SQL standard - Familiar syntax
Example Queries¶
-- Events by source in last 24 hours
SELECT source_ip, COUNT(*) as events
FROM read_parquet('s3://bucket/bytefreezer/*/tenant/dataset/year=2024/month=01/day=15/*/*.parquet')
GROUP BY source_ip
ORDER BY events DESC
LIMIT 10;
-- Failed logins by country
SELECT geo.country, COUNT(*) as failures
FROM read_parquet('s3://bucket/bytefreezer/**/*.parquet')
WHERE action = 'login' AND status = 'failed'
AND timestamp >= '2024-01-15'
GROUP BY geo.country
ORDER BY failures DESC;
-- Time series aggregation
SELECT
date_trunc('hour', timestamp) as hour,
COUNT(*) as events,
COUNT(DISTINCT user_id) as unique_users
FROM events
WHERE timestamp >= current_date - interval '7 days'
GROUP BY 1
ORDER BY 1;
AI Agents¶
Query your data using natural language with AI agents.
Built-in AI¶
ByteFreezer includes an AI query interface:
User: "Show me failed login attempts from Russia in the last week"
AI: Executing query...
Results:
| timestamp | user | source_ip | country |
|---------------------|-------------|--------------|---------|
| 2024-01-15 10:23:45 | admin | 185.x.x.x | Russia |
| 2024-01-15 11:45:12 | root | 91.x.x.x | Russia |
| ... | ... | ... | ... |
Found 47 failed login attempts from Russia in the last 7 days.
BYOA - Bring Your Own AI¶
For air-gapped or FedRAMP environments, plug in your own AI model:
ai:
provider: custom
endpoint: https://your-llm-endpoint.internal/v1/chat
model: your-model-name
# Or use local models
# provider: ollama
# endpoint: http://localhost:11434
# model: llama2
Supported providers:
| Provider | Description |
|---|---|
| OpenAI | GPT-4, GPT-3.5 |
| Anthropic | Claude |
| Azure OpenAI | Azure-hosted OpenAI |
| Ollama | Local models (Llama, Mistral) |
| Custom | Any OpenAI-compatible API |
How It Works¶
Natural Language Query
│
▼
┌───────────────┐
│ AI Model │ ──▶ Understands intent
└───────┬───────┘
│
▼
┌───────────────┐
│ SQL Generator │ ──▶ Creates optimized query
└───────┬───────┘
│
▼
┌───────────────┐
│ DuckDB │ ──▶ Executes against Parquet
└───────┬───────┘
│
▼
┌───────────────┐
│ AI Model │ ──▶ Summarizes results
└───────┬───────┘
│
▼
Response
Grafana Integration¶
Connect Grafana for dashboards and alerting.
Using Trino Plugin¶
# Trino configuration for Grafana
datasource:
type: trino
url: http://trino:8080
catalog: bytefreezer
schema: events
Using JSON/Infinity Plugin¶
Query ByteFreezer's API directly:
Query Best Practices¶
Use Partition Pruning¶
Always include time ranges to leverage partitioning:
-- Good: Uses partitions
SELECT * FROM events
WHERE timestamp >= '2024-01-15' AND timestamp < '2024-01-16'
-- Bad: Full scan
SELECT * FROM events
WHERE user = 'alice'
Select Only Needed Columns¶
Parquet is columnar - only selected columns are read:
-- Good: Reads only 2 columns
SELECT user, action FROM events
-- Bad: Reads all columns
SELECT * FROM events
Use Approximate Functions¶
For large datasets, use approximations:
-- Fast approximate count
SELECT approx_count_distinct(user_id) FROM events
-- Fast percentiles
SELECT approx_quantile(response_time, 0.95) FROM events
API Access¶
Query programmatically via the API:
curl -X POST https://your-bytefreezer/api/v1/query \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "SELECT COUNT(*) FROM events WHERE timestamp >= current_date - 1",
"format": "json"
}'
Response:
Air-Gapped Queries¶
For FedRAMP or air-gapped environments:
- Local DuckDB - No network required
- Local AI model - Ollama or similar
- No cloud dependencies - Query stays on-premises
- Your data, your control - Full sovereignty
See Control Deployment for setup.