Monitoring and Observability¶
Set up comprehensive monitoring, logging, and tracing for USL applications.
Overview¶
USL generates monitoring configurations for: - Metrics: Prometheus scraping and Grafana dashboards - Logs: Structured logging with centralized aggregation - Traces: Distributed tracing with OpenTelemetry - Alerts: Proactive incident detection
Enable Monitoring¶
Prometheus Setup¶
Install Prometheus Stack¶
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
ServiceMonitor¶
Generated automatically:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
labels:
release: prometheus
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
path: /metrics
Application Metrics¶
Implement metrics endpoint in your application:
from prometheus_client import Counter, Histogram, generate_latest
request_count = Counter('http_requests_total', 'Total HTTP requests')
request_duration = Histogram('http_request_duration_seconds', 'HTTP request latency')
@app.get("/metrics")
async def metrics():
return Response(generate_latest(), media_type="text/plain")
Key Metrics¶
# Request rate
http_requests_total
# Error rate
http_requests_errors_total
# Latency
http_request_duration_seconds
# Resource usage
container_cpu_usage_seconds_total
container_memory_working_set_bytes
Grafana Dashboards¶
Access Grafana¶
Default credentials: admin/prom-operator
USL Application Dashboard¶
Generated dashboard includes: - Request rate (QPS) - Error rate (4xx, 5xx) - P50, P95, P99 latency - CPU and memory usage - Pod status and restarts
Import Dashboard¶
Custom Queries¶
# Request rate
rate(http_requests_total[5m])
# Error rate
rate(http_requests_errors_total[5m]) / rate(http_requests_total[5m])
# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Logging¶
Structured Logging¶
import structlog
log = structlog.get_logger()
log.info("user_login", user_id=user.id, ip_address=request.client.host)
Log Aggregation¶
ELK Stack¶
helm repo add elastic https://helm.elastic.co
helm install elasticsearch elastic/elasticsearch -n logging --create-namespace
helm install kibana elastic/kibana -n logging
helm install filebeat elastic/filebeat -n logging
Loki (Lightweight)¶
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
--namespace logging \
--create-namespace \
--set promtail.enabled=true
Query Logs¶
Loki query:
Distributed Tracing¶
OpenTelemetry Setup¶
Install Jaeger¶
kubectl create namespace observability
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.49.0/jaeger-operator.yaml -n observability
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: simplest
namespace: observability
EOF
Application Integration¶
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
trace.set_tracer_provider(TracerProvider())
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger-agent",
agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
tracer = trace.get_tracer(__name__)
@app.get("/users/{user_id}")
async def get_user(user_id: str):
with tracer.start_as_current_span("get_user"):
# Your code here
pass
Access Jaeger UI¶
Alerts¶
PrometheusRule¶
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-alerts
namespace: production
spec:
groups:
- name: my-app
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_errors_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} (threshold: 0.05)"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod is crash looping"
Alert Notification¶
Slack¶
global:
slack_api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
route:
receiver: 'slack'
group_by: ['alertname']
group_wait: 10s
repeat_interval: 12h
receivers:
- name: 'slack'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
PagerDuty¶
Health Checks¶
Liveness Probe¶
Readiness Probe¶
@app.get("/health/ready")
async def readiness():
# Check database connection
try:
await database.execute("SELECT 1")
return {"status": "ready"}
except:
raise HTTPException(status_code=503, detail="Not ready")
Performance Monitoring¶
Application Performance Monitoring (APM)¶
Datadog¶
env:
- name: DD_AGENT_HOST
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: DD_SERVICE
value: "my-app"
- name: DD_ENV
value: "production"
New Relic¶
env:
- name: NEW_RELIC_LICENSE_KEY
valueFrom:
secretKeyRef:
name: newrelic
key: license-key
- name: NEW_RELIC_APP_NAME
value: "my-app"
Dashboards¶
SLO Dashboard¶
Track Service Level Objectives:
Business Metrics¶
# Active users
sum(rate(user_login_total[5m]))
# Revenue per minute
sum(rate(transaction_amount_total[5m]))
# Conversion rate
rate(signup_success_total[5m]) / rate(signup_attempt_total[5m])
Cost Monitoring¶
Kubecost¶
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm install kubecost kubecost/cost-analyzer \
--namespace kubecost \
--create-namespace
Access: http://localhost:9090
Incident Response¶
Runbooks¶
Create runbooks for common alerts:
## High Error Rate Alert
1. Check recent deployments
2. Review application logs
3. Check database connectivity
4. Verify external API status
5. Rollback if necessary
On-Call Rotation¶
Use PagerDuty schedules:
schedules:
- name: Primary
timezone: America/Los_Angeles
layers:
- users: [alice, bob]
rotation_turnover: daily
start: "2024-01-01T09:00:00"
Compliance and Audit¶
Audit Logging¶
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: RequestResponse
resources:
- group: ""
resources: ["secrets"]
Log Retention¶
Next Steps¶
- Kubernetes Guide - K8s deployment
- Helm Guide - Helm charts
- Secrets - Secret management
- Cloud Providers - Cloud platforms