Skip to main content

Observability Overview

Optimal Platform includes a comprehensive observability stack for monitoring, logging, and alerting.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY STACK │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Applications │ │ Kubernetes │ │ Infrastructure│ │
│ │ (Services) │ │ (Nodes/Pods) │ │ (Cloud) │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Collection Layer │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Prometheus │ │ Promtail │ │ Falco │ │ │
│ │ │ (Metrics) │ │ (Logs) │ │ (Events) │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │
│ └─────────┼────────────────┼────────────────┼──────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Storage Layer │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Prometheus │ │ Loki │ │ Postgres │ │ │
│ │ │ (TSDB) │ │ (Log Store)│ │ (Events) │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │
│ └─────────┼────────────────┼────────────────┼──────────────────────────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Visualization Layer │ │
│ │ ┌─────────────────┐ │ │
│ │ │ Grafana │ │ │
│ │ │ (Dashboards) │ │ │
│ │ └─────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Components

Prometheus (Metrics)

Prometheus collects and stores time-series metrics from all platform components.

Default Scrape Targets:

  • Kubernetes API server
  • Kubernetes nodes (kubelet)
  • All platform services
  • PostgreSQL exporter
  • Redis exporter

Configuration:

prometheus:
enabled: true
retention: 15d
scrapeInterval: 15s
evaluationInterval: 15s

Key Metrics:

MetricDescription
http_requests_totalTotal HTTP requests
http_request_duration_secondsRequest latency
sbom_scans_totalTotal SBOM scans
vulnerabilities_detectedVulnerabilities found
policy_violations_totalKyverno policy violations

Loki (Logging)

Loki provides log aggregation with Grafana integration.

Components:

  • Loki Server: Log storage and querying
  • Promtail: Log collection agent (DaemonSet)

Log Sources:

  • Container logs (/var/log/pods)
  • Docker daemon logs
  • System logs

Configuration:

loki:
enabled: true
persistence:
enabled: true
size: 10Gi

promtail:
enabled: true
config:
positions:
filename: /run/promtail/positions.yaml

Grafana (Visualization)

Grafana provides dashboards for metrics and logs visualization.

Pre-configured Dashboards:

  • Platform Overview
  • Service Health
  • Kubernetes Cluster
  • Security Events
  • Vulnerability Trends

Data Sources:

  • Prometheus (metrics)
  • Loki (logs)

Access:

URL: https://observability.gooptimal.io
Default credentials: admin / (from secret)

Alerting

Alert Rules

Optimal Platform includes pre-configured alerts:

Critical Alerts:

  • Service down (no healthy pods)
  • Database connection failure
  • High error rate (greater than 5%)
  • Certificate expiring (less than 7 days)

Warning Alerts:

  • High CPU usage (greater than 80%)
  • High memory usage (greater than 80%)
  • Disk space low (less than 20%)
  • High request latency (greater than 1s p99)

Alert Channels

Configure alert destinations in Grafana:

  • Email
  • Slack
  • PagerDuty
  • Webhook

Accessing Observability

Grafana Dashboard

# Port forward for local access
kubectl port-forward svc/grafana 3001:80 -n monitoring

# Access at http://localhost:3001

Prometheus UI

# Port forward for local access
kubectl port-forward svc/prometheus 9090:9090 -n monitoring

# Access at http://localhost:9090

Querying Logs

Using Grafana Explore or LogCLI:

# Install LogCLI
brew install grafana/tap/logcli

# Query logs
logcli query '{namespace="optimal-system"}'

# Filter by service
logcli query '{namespace="optimal-system", app="api-gateway"}'

# Search for errors
logcli query '{namespace="optimal-system"} |= "error"'

Resource Requirements

ComponentCPU RequestMemory RequestStorage
Prometheus500m1Gi50Gi
Loki250m512Mi10Gi
Promtail100m128Mi-
Grafana250m256Mi1Gi