Service Inventory & Log Taxonomy
[[TOC]]
Overview
Total Services: 26 Categories: 7 Current Log State: Mostly unstructured text Target Log State: JSON-formatted with standard fields
Infrastructure (5 services)
| Service | Purpose | Log Types | Key Metrics |
|---|---|---|---|
| Traefik | Edge proxy, TLS termination | HTTP access logs, routing decisions, backend health | Request rate, latency p50/p95/p99, error rate |
| PostgreSQL | Primary database | Query logs, connections, replication, autovacuum | Active connections, query duration, deadlocks, replication lag |
| Redis | Cache & sessions | Command logs, evictions, persistence | Ops/sec, memory usage, eviction rate, hit ratio |
| RavenmaskOS DB | Custom OS database | Schema operations, queries | Query count, schema version |
| Redpanda | Event streaming | Broker logs, partition rebalance, consumer lag | Throughput MB/s, consumer lag, partition count |
Identity/Security (5 services)
| Service | Purpose | Log Types | Key Metrics |
|---|---|---|---|
| Zitadel | Identity provider | Auth events, token issuance, user operations | Auth success/fail rate, token issuance rate, active sessions |
| SPIRE Server | Workload identity | SVID issuance, attestation | Workload count, SVID TTL distribution, attestation failures |
| SPIRE Agent | Workload attestation | Local attestation, SVID delivery | Agent health, workload registrations |
| OpenFGA | Authorization | Relation checks, tuple writes | Check latency, tuple write rate, cache hit ratio |
| OAuth2 Proxy | SSO middleware | Session creation, upstream auth | Session count, auth failures, token refresh rate |
AI/Automation (5 services)
| Service | Purpose | Log Types | Key Metrics |
|---|---|---|---|
| n8n | Workflow automation | Execution logs, node failures, webhooks | Workflow success rate, execution duration, active workflows |
| Ollama | Local LLM inference | Model loads, inference requests, errors | Inference latency, queue depth, model memory usage |
| LangGraph | Agent framework | Execution logs, state transitions, tool calls | Execution count, avg duration, error rate, tool usage |
| Langfuse | LLM observability | Trace spans, token usage, costs | Token count, cost per request, trace duration |
| Norns | Multi-agent orchestration | Agent coordination, task distribution | Task success rate, agent invocations, response latency |
Voice (3 services)
| Service | Purpose | Log Types | Key Metrics |
|---|---|---|---|
| LiveKit | WebRTC infrastructure | Room events, track publications, TURN usage | Active rooms, participant count, bitrate, packet loss |
| Voice Gateway | Web voice via LiveKit + OpenAI Realtime | Session events, Norns delegations | Active sessions, response latency, error rate |
| Telephony | Phone voice via Twilio + Pipecat | Call events, Twilio webhooks, STT/TTS processing | Call count, caller identification rate, call duration |
DevOps (3 services)
| Service | Purpose | Log Types | Key Metrics |
|---|---|---|---|
| GitLab | Source control, CI/CD | Pipeline runs, merge events, webhooks | Pipeline success rate, job duration, runner utilization |
| Grafana | Observability UI | Dashboard loads, alert evaluations, queries | Active users, dashboard load time, alert eval latency |
| MCP GitLab | GitLab MCP server | Tool invocations, API requests | Request rate, error rate, latency |
Monitoring (7 services)
| Service | Purpose | Log Types | Key Metrics |
|---|---|---|---|
| Prometheus | Metrics collection | Scrape health, rule evaluation | Scrape duration, cardinality, rule eval time |
| Loki | Log aggregation | Ingestion, queries, compaction | Ingestion rate, query latency, storage size |
| Tempo | Distributed tracing | Trace ingestion, sampling | Trace ingestion rate, sampling ratio, query latency |
| Alloy | Telemetry pipeline | Receiver/processor/exporter logs | Pipeline lag, error rate, throughput |
| Uptime Kuma | Uptime monitoring | Probe results, notifications | Uptime %, probe latency, notification delivery |
| cAdvisor | Container metrics | Resource usage per container | CPU/memory per container |
| Node Exporter | Host metrics | System-level metrics | CPU, memory, disk, network |
Home Automation (3 services)
| Service | Purpose | Log Types | Key Metrics |
|---|---|---|---|
| Home Assistant | Home automation hub | Entity state changes, automation triggers | Active automations, entity count, event rate |
| Homebridge | HomeKit bridge | Accessory updates, HomeKit events | Accessory count, event rate |
| Grocy | Household inventory | Inventory changes, shopping list | Item count, expiring items |
Log Structure Requirements
Standard Fields (all services)
{
"timestamp": "2026-01-02T23:38:53.524Z",
"level": "info|warn|error|debug",
"service": "traefik",
"trace_id": "e81cfe01-66cf-4334-83b6-b5b7b27643d7",
"message": "Backend health check passed",
"context": {
// Service-specific fields
}
}
Service-Specific Context Examples
Traefik:
"context": {
"method": "GET",
"path": "/api/widgets/resources",
"status": 304,
"duration_ms": 32,
"backend": "dashboard",
"client_ip": "172.20.0.1"
}
PostgreSQL:
"context": {
"query": "SELECT * FROM users WHERE id = $1",
"duration_ms": 12.4,
"rows": 1,
"connection_id": "5f3a2b1c"
}
LangGraph:
"context": {
"agent_id": "sre-agent-v2.1.4",
"execution_id": "exec_abc123",
"state": "investigating",
"tool_called": "query_loki",
"duration_ms": 847
}
Telephony:
"context": {
"call_sid": "CA1234567890abcdef",
"caller_phone": "+15127812507",
"user_id": "701973d2-57e4-4c84-a2ec-ded996dcf676",
"display_name": "Nathan Walker",
"event": "call_started"
}
Next: [[AIOps-Observability-Stack]] - Alloy, Loki, Prometheus configuration