Query & AI¶

ByteFreezer stores data in Parquet format, enabling powerful querying through SQL engines and AI agents.

Query Options¶

Engine	Best For	Integration
DuckDB	Fast SQL queries, embedded	Built-in
AI Agents	Natural language queries	Built-in / BYOA

DuckDB Integration¶

DuckDB provides lightning-fast SQL queries directly against Parquet files in S3.

Why DuckDB?¶

Zero dependencies - Single binary, no cluster required
Direct S3 access - Query Parquet files in place
Blazing fast - Columnar engine optimized for analytics
SQL standard - Familiar syntax

Example Queries¶

-- Events by source in last 24 hours
SELECT source_ip, COUNT(*) as events
FROM read_parquet('s3://bucket/bytefreezer/*/tenant/dataset/year=2024/month=01/day=15/*/*.parquet')
GROUP BY source_ip
ORDER BY events DESC
LIMIT 10;

-- Failed logins by country
SELECT geo.country, COUNT(*) as failures
FROM read_parquet('s3://bucket/bytefreezer/**/*.parquet')
WHERE action = 'login' AND status = 'failed'
  AND timestamp >= '2024-01-15'
GROUP BY geo.country
ORDER BY failures DESC;

-- Time series aggregation
SELECT
  date_trunc('hour', timestamp) as hour,
  COUNT(*) as events,
  COUNT(DISTINCT user_id) as unique_users
FROM events
WHERE timestamp >= current_date - interval '7 days'
GROUP BY 1
ORDER BY 1;

AI Agents¶

Query your data using natural language with AI agents.

Built-in AI¶

ByteFreezer includes an AI query interface:

User: "Show me failed login attempts from Russia in the last week"

AI: Executing query...

Results:
| timestamp           | user        | source_ip    | country |
|---------------------|-------------|--------------|---------|
| 2024-01-15 10:23:45 | admin       | 185.x.x.x    | Russia  |
| 2024-01-15 11:45:12 | root        | 91.x.x.x     | Russia  |
| ...                 | ...         | ...          | ...     |

Found 47 failed login attempts from Russia in the last 7 days.

BYOA - Bring Your Own AI¶

For air-gapped environments, plug in your own AI model:

ai:
  provider: custom
  endpoint: https://your-llm-endpoint.internal/v1/chat
  model: your-model-name

  # Or use local models
  # provider: ollama
  # endpoint: http://localhost:11434
  # model: llama2

Supported providers:

Provider	Description
OpenAI	GPT-4, GPT-3.5
Anthropic	Claude
Azure OpenAI	Azure-hosted OpenAI
Ollama	Local models (Llama, Mistral)
Custom	Any OpenAI-compatible API

How It Works¶

Natural Language Query
        │
        ▼
┌───────────────┐
│   AI Model    │  ──▶  Understands intent
└───────┬───────┘
        │
        ▼
┌───────────────┐
│ SQL Generator │  ──▶  Creates optimized query
└───────┬───────┘
        │
        ▼
┌───────────────┐
│    DuckDB     │  ──▶  Executes against Parquet
└───────┬───────┘
        │
        ▼
┌───────────────┐
│   AI Model    │  ──▶  Summarizes results
└───────┬───────┘
        │
        ▼
    Response

Query Best Practices¶

Use Partition Pruning¶

Always include time ranges to leverage partitioning:

-- Good: Uses partitions
SELECT * FROM events
WHERE timestamp >= '2024-01-15' AND timestamp < '2024-01-16'

-- Bad: Full scan
SELECT * FROM events
WHERE user = 'alice'

Select Only Needed Columns¶

Parquet is columnar - only selected columns are read:

-- Good: Reads only 2 columns
SELECT user, action FROM events

-- Bad: Reads all columns
SELECT * FROM events

Use Approximate Functions¶

For large datasets, use approximations:

-- Fast approximate count
SELECT approx_count_distinct(user_id) FROM events

-- Fast percentiles
SELECT approx_quantile(response_time, 0.95) FROM events

API Access¶

Query programmatically via the API:

curl -X POST https://your-bytefreezer/api/v1/query \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "SELECT COUNT(*) FROM events WHERE timestamp >= current_date - 1",
    "format": "json"
  }'

Response:

{
  "columns": ["count"],
  "rows": [[1234567]],
  "execution_time_ms": 45
}

Air-Gapped Queries¶

For air-gapped environments:

Local DuckDB - No network required
Local AI model - Ollama or similar
No cloud dependencies - Query stays on-premises
Your data, your control - Full sovereignty

See Control Deployment for setup.