Skip to content

Query & AI

ByteFreezer stores data in Parquet format, enabling powerful querying through SQL engines and AI agents.

Query Options

Engine Best For Integration
DuckDB Fast SQL queries, embedded Built-in
AI Agents Natural language queries Built-in / BYOA
Trino/Presto Distributed queries External
ClickHouse Real-time analytics External
Grafana Dashboards & visualization Plugin

DuckDB Integration

DuckDB provides lightning-fast SQL queries directly against Parquet files in S3.

Why DuckDB?

  • Zero dependencies - Single binary, no cluster required
  • Direct S3 access - Query Parquet files in place
  • Blazing fast - Columnar engine optimized for analytics
  • SQL standard - Familiar syntax

Example Queries

-- Events by source in last 24 hours
SELECT source_ip, COUNT(*) as events
FROM read_parquet('s3://bucket/bytefreezer/*/tenant/dataset/year=2024/month=01/day=15/*/*.parquet')
GROUP BY source_ip
ORDER BY events DESC
LIMIT 10;

-- Failed logins by country
SELECT geo.country, COUNT(*) as failures
FROM read_parquet('s3://bucket/bytefreezer/**/*.parquet')
WHERE action = 'login' AND status = 'failed'
  AND timestamp >= '2024-01-15'
GROUP BY geo.country
ORDER BY failures DESC;

-- Time series aggregation
SELECT
  date_trunc('hour', timestamp) as hour,
  COUNT(*) as events,
  COUNT(DISTINCT user_id) as unique_users
FROM events
WHERE timestamp >= current_date - interval '7 days'
GROUP BY 1
ORDER BY 1;

AI Agents

Query your data using natural language with AI agents.

Built-in AI

ByteFreezer includes an AI query interface:

User: "Show me failed login attempts from Russia in the last week"

AI: Executing query...

Results:
| timestamp           | user        | source_ip    | country |
|---------------------|-------------|--------------|---------|
| 2024-01-15 10:23:45 | admin       | 185.x.x.x    | Russia  |
| 2024-01-15 11:45:12 | root        | 91.x.x.x     | Russia  |
| ...                 | ...         | ...          | ...     |

Found 47 failed login attempts from Russia in the last 7 days.

BYOA - Bring Your Own AI

For air-gapped or FedRAMP environments, plug in your own AI model:

ai:
  provider: custom
  endpoint: https://your-llm-endpoint.internal/v1/chat
  model: your-model-name

  # Or use local models
  # provider: ollama
  # endpoint: http://localhost:11434
  # model: llama2

Supported providers:

Provider Description
OpenAI GPT-4, GPT-3.5
Anthropic Claude
Azure OpenAI Azure-hosted OpenAI
Ollama Local models (Llama, Mistral)
Custom Any OpenAI-compatible API

How It Works

Natural Language Query
┌───────────────┐
│   AI Model    │  ──▶  Understands intent
└───────┬───────┘
┌───────────────┐
│ SQL Generator │  ──▶  Creates optimized query
└───────┬───────┘
┌───────────────┐
│    DuckDB     │  ──▶  Executes against Parquet
└───────┬───────┘
┌───────────────┐
│   AI Model    │  ──▶  Summarizes results
└───────┬───────┘
    Response

Grafana Integration

Connect Grafana for dashboards and alerting.

Using Trino Plugin

# Trino configuration for Grafana
datasource:
  type: trino
  url: http://trino:8080
  catalog: bytefreezer
  schema: events

Using JSON/Infinity Plugin

Query ByteFreezer's API directly:

datasource:
  type: yesoreyeram-infinity-datasource
  url: https://your-bytefreezer/api/v1/query

Query Best Practices

Use Partition Pruning

Always include time ranges to leverage partitioning:

-- Good: Uses partitions
SELECT * FROM events
WHERE timestamp >= '2024-01-15' AND timestamp < '2024-01-16'

-- Bad: Full scan
SELECT * FROM events
WHERE user = 'alice'

Select Only Needed Columns

Parquet is columnar - only selected columns are read:

-- Good: Reads only 2 columns
SELECT user, action FROM events

-- Bad: Reads all columns
SELECT * FROM events

Use Approximate Functions

For large datasets, use approximations:

-- Fast approximate count
SELECT approx_count_distinct(user_id) FROM events

-- Fast percentiles
SELECT approx_quantile(response_time, 0.95) FROM events

API Access

Query programmatically via the API:

curl -X POST https://your-bytefreezer/api/v1/query \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "SELECT COUNT(*) FROM events WHERE timestamp >= current_date - 1",
    "format": "json"
  }'

Response:

{
  "columns": ["count"],
  "rows": [[1234567]],
  "execution_time_ms": 45
}

Air-Gapped Queries

For FedRAMP or air-gapped environments:

  1. Local DuckDB - No network required
  2. Local AI model - Ollama or similar
  3. No cloud dependencies - Query stays on-premises
  4. Your data, your control - Full sovereignty

See Control Deployment for setup.