Monitoring Guide¶
This guide covers monitoring, metrics, and health checks for CodeGraph in production environments.
Table of Contents¶
- Overview
- Quick Start
- Enable Metrics
- Health Check Server
- Prometheus Metrics
- Available Metrics
- Histogram Buckets
- Recording Metrics
- Decorators
- @track_execution
- @track_agent
- @track_scenario
- Health Checks
- Health Status
- Component Health
- Custom Health Checks
- Health Endpoints
- Structured Logging
- Setup
- Log Format
- Context Logging
- Grafana Dashboards
- Recommended Panels
- Alert Rules
- Kubernetes Integration
- Deployment Configuration
- ServiceMonitor for Prometheus
- Best Practices
- See Also
Overview¶
The monitoring module provides: - Prometheus metrics for observability - Structured JSON logging for analysis - Health check endpoints for orchestration - Monitoring decorators for instrumentation
Quick Start¶
Enable Metrics¶
from src.monitoring import (
MetricsCollector,
track_execution,
track_agent,
setup_structured_logging
)
# Setup structured logging
setup_structured_logging(log_level="INFO")
# Get metrics collector
metrics = MetricsCollector()
# Use decorators for automatic tracking
@track_execution("my_operation")
def process_query(query: str):
# Your code here
return result
Health Check Server¶
from src.monitoring.health import HealthCheckServer
# Start health check server (runs alongside main app)
health_server = HealthCheckServer(port=8081)
health_server.start()
# Endpoints available:
# GET /health - Overall health status
# GET /ready - Readiness probe
# GET /live - Liveness probe
# GET /metrics - Prometheus metrics
# GET /stats - System statistics
Prometheus Metrics¶
Available Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
rag_scenario_duration_seconds |
Histogram | scenario_name | Scenario execution time |
rag_scenario_success_total |
Counter | scenario_name | Successful executions |
rag_scenario_failure_total |
Counter | scenario_name, error_type | Failed executions |
rag_agent_duration_seconds |
Histogram | agent_name, scenario | Agent execution time |
rag_agent_success_total |
Counter | agent_name, scenario | Agent successes |
rag_agent_failure_total |
Counter | agent_name, scenario, error_type | Agent failures |
rag_cache_hits_total |
Counter | cache_name | Cache hits |
rag_cache_misses_total |
Counter | cache_name | Cache misses |
rag_llm_requests_total |
Counter | model, status | LLM API requests |
rag_llm_duration_seconds |
Histogram | model | LLM latency |
rag_llm_tokens_total |
Counter | model, type | Tokens used |
rag_query_duration_seconds |
Histogram | query_type | Database query time |
rag_active_connections |
Gauge | - | Active connections |
Histogram Buckets¶
# Scenario duration buckets (seconds)
[0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0]
# Agent duration buckets (seconds)
[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
# Query duration buckets (milliseconds)
[1, 5, 10, 25, 50, 100, 250, 500, 1000]
Recording Metrics¶
from src.monitoring.metrics import (
SCENARIO_DURATION,
SCENARIO_SUCCESS,
SCENARIO_FAILURE,
LLM_REQUESTS,
LLM_DURATION,
LLM_TOKENS
)
# Record scenario execution
with SCENARIO_DURATION.labels(scenario_name="security_audit").time():
result = run_scenario()
SCENARIO_SUCCESS.labels(scenario_name="security_audit").inc()
# Record LLM usage
LLM_REQUESTS.labels(model="GigaChat-2-Pro", status="success").inc()
LLM_DURATION.labels(model="GigaChat-2-Pro").observe(2.5)
LLM_TOKENS.labels(model="GigaChat-2-Pro", type="completion").inc(150)
Decorators¶
@track_execution¶
Automatically tracks function execution time and success/failure.
from src.monitoring import track_execution
@track_execution("data_processing")
def process_data(data: list) -> dict:
# Automatically records:
# - Duration to rag_operation_duration_seconds
# - Success/failure to rag_operation_total
return processed_data
@track_execution("api_call", log_args=True)
def call_api(url: str, params: dict) -> dict:
# Also logs function arguments
return response
@track_agent¶
Track agent execution with scenario context.
from src.monitoring import track_agent
@track_agent("analyzer")
def run_analyzer(question: str, scenario: str):
# Records to rag_agent_* metrics
return analysis
@track_scenario¶
Track complete scenario execution.
from src.monitoring import track_scenario
@track_scenario("security_audit")
def run_security_audit(codebase: str):
# Records to rag_scenario_* metrics
return findings
Health Checks¶
Health Status¶
class HealthStatus(Enum):
HEALTHY = "healthy"
DEGRADED = "degraded"
UNHEALTHY = "unhealthy"
Component Health¶
The system checks these components:
| Component | Check | Threshold |
|---|---|---|
| Database | Query latency | < 100ms |
| LLM Provider | API response | < 5s |
| Vector Store | Query latency | < 200ms |
| Cache | Read/write | < 50ms |
| Joern | Connection | < 1s |
Custom Health Checks¶
from src.monitoring.health import HealthChecker, ComponentHealth, HealthStatus
checker = HealthChecker()
# Add custom component check
def check_custom_service():
try:
response_time = ping_service()
if response_time < 100:
return ComponentHealth(
name="custom_service",
status=HealthStatus.HEALTHY,
latency_ms=response_time
)
else:
return ComponentHealth(
name="custom_service",
status=HealthStatus.DEGRADED,
latency_ms=response_time,
message="High latency"
)
except Exception as e:
return ComponentHealth(
name="custom_service",
status=HealthStatus.UNHEALTHY,
message=str(e)
)
checker.add_check("custom_service", check_custom_service)
Health Endpoints¶
GET /health
{
"status": "healthy",
"components": [
{
"name": "database",
"status": "healthy",
"latency_ms": 2.5,
"message": ""
},
{
"name": "llm_provider",
"status": "healthy",
"latency_ms": 450.0,
"message": ""
}
],
"timestamp": 1702580000.0,
"uptime_seconds": 3600.0,
"version": "2.0.0"
}
GET /ready
{
"ready": true,
"checks_passed": 4,
"checks_total": 4
}
GET /live
{
"alive": true,
"uptime_seconds": 3600.0
}
Structured Logging¶
Setup¶
from src.monitoring import setup_structured_logging
# Configure structured logging
setup_structured_logging(
log_level="INFO",
json_format=True,
include_timestamp=True,
include_caller=True
)
Log Format¶
{
"timestamp": "2025-12-14T10:30:00.000Z",
"level": "INFO",
"logger": "src.agents.analyzer",
"message": "Query processed",
"context": {
"scenario": "security_audit",
"query_id": "abc123",
"duration_ms": 245
},
"caller": {
"file": "analyzer.py",
"line": 45,
"function": "process_query"
}
}
Context Logging¶
from src.monitoring import log_context, get_logger
logger = get_logger(__name__)
# Add context to all logs in scope
with log_context(request_id="abc123", user_id="user1"):
logger.info("Processing request") # Includes request_id and user_id
result = process()
logger.info("Request completed", extra={"result_count": len(result)})
Grafana Dashboards¶
Recommended Panels¶
Request Rate:
rate(rag_scenario_success_total[5m]) + rate(rag_scenario_failure_total[5m])
Error Rate:
rate(rag_scenario_failure_total[5m]) / (rate(rag_scenario_success_total[5m]) + rate(rag_scenario_failure_total[5m]))
P95 Latency:
histogram_quantile(0.95, rate(rag_scenario_duration_seconds_bucket[5m]))
LLM Token Usage:
sum(rate(rag_llm_tokens_total[1h])) by (model)
Cache Hit Rate:
rate(rag_cache_hits_total[5m]) / (rate(rag_cache_hits_total[5m]) + rate(rag_cache_misses_total[5m]))
Alert Rules¶
groups:
- name: codegraph
rules:
- alert: HighErrorRate
expr: |
sum(rate(rag_scenario_failure_total[5m]))
/ sum(rate(rag_scenario_success_total[5m]) + rate(rag_scenario_failure_total[5m]))
> 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(rag_scenario_duration_seconds_bucket[5m])) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency above 30s"
- alert: ComponentUnhealthy
expr: rag_component_health == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Component {{ $labels.component }} is unhealthy"
Kubernetes Integration¶
Deployment Configuration¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: codegraph
spec:
template:
spec:
containers:
- name: codegraph
ports:
- containerPort: 8000
name: api
- containerPort: 8081
name: health
livenessProbe:
httpGet:
path: /live
port: health
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: health
initialDelaySeconds: 10
periodSeconds: 5
ServiceMonitor for Prometheus¶
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: codegraph
spec:
selector:
matchLabels:
app: codegraph
endpoints:
- port: health
path: /metrics
interval: 30s
Best Practices¶
- Use decorators for automatic instrumentation
- Add context to logs for traceability
- Set appropriate thresholds for alerts
- Monitor token usage to control costs
- Check P95/P99 latencies not just averages
- Export metrics to Prometheus for persistence
- Use structured logging for log aggregation
See Also¶
- REST API Reference - API endpoints
- Prometheus Documentation
- Grafana Documentation