This guide covers monitoring, metrics, and health checks for CodeGraph in production environments.
Table of Contents¶
- Overview
- Quick Start
- Enable Metrics
- Health Check Application
- Prometheus Metrics
- Available Metrics
- Histogram Buckets
- Recording Metrics
- Recording Functions
- Decorators
- @monitor_scenario
- @monitor_agent
- @monitor_cache
- MetricsCollector
- MetricsSummary
- Health Checks
- Health Status
- Component Health
- Custom Health Checks
- Health Endpoints — Standalone App
- Health Endpoints — API Router
- Standalone Functions
- Structured Logging
- StructuredLogger
- Log Format
- Timed Operations
- MetricsMiddleware
- Grafana Dashboards
- Recommended Panels
- Alert Rules
- Alertmanager
- Yandex Cloud Dashboard
- Kubernetes Integration
- Deployment Configuration
- ServiceMonitor for Prometheus
- Best Practices
- CISO/CTO Grafana Dashboard
- Dashboard Alert Rules
- Dashboard V2 Metrics
- See Also
Overview¶
The monitoring module (src/monitoring/) provides:
- Prometheus metrics — 20 metrics for observability
- Structured JSON logging — StructuredLogger with timed_operation() context manager
- Health check endpoints — standalone FastAPI app and API router
- Monitoring decorators — monitor_scenario, monitor_agent, monitor_cache
- Recording functions — record_llm_call, record_cpg_query, record_retrieval
- MetricsCollector — singleton for in-memory metric aggregation
- MetricsMiddleware — automatic HTTP request instrumentation
Quick Start¶
Enable Metrics¶
from src.monitoring import (
MetricsCollector,
StructuredLogger,
monitor_scenario,
monitor_agent,
monitor_cache,
record_llm_call,
)
# Create structured logger
logger = StructuredLogger("my_component")
# Get metrics collector (singleton)
metrics = MetricsCollector()
# Use decorators for automatic tracking
@monitor_scenario("my_scenario")
def process_query(query: str):
# Automatically records SCENARIO_DURATION, SCENARIO_SUCCESS/FAILURE,
# ACTIVE_REQUESTS, TOTAL_REQUESTS
return result
Health Check Application¶
from src.monitoring.health import create_health_app
# Create standalone health check FastAPI app
app = create_health_app()
# Endpoints available:
# GET /health - Overall health status
# GET /health/live - Liveness probe
# GET /health/ready - Readiness probe
# GET /metrics - Prometheus metrics
# GET /stats - System statistics
Prometheus Metrics¶
Available Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
rag_scenario_duration_seconds |
Histogram | scenario_name |
Scenario execution time |
rag_scenario_success_total |
Counter | scenario_name |
Successful executions |
rag_scenario_failure_total |
Counter | scenario_name, error_type |
Failed executions |
rag_agent_duration_seconds |
Histogram | agent_name, scenario |
Agent execution time |
rag_agent_success_total |
Counter | agent_name, scenario |
Agent successes |
rag_agent_failure_total |
Counter | agent_name, scenario, error_type |
Agent failures |
rag_cache_hits_total |
Counter | cache_type |
Cache hits |
rag_cache_misses_total |
Counter | cache_type |
Cache misses |
rag_cache_size |
Gauge | cache_type |
Current cache size |
rag_active_requests |
Gauge | — | Active requests being processed |
rag_total_requests |
Counter | — | Total requests processed |
rag_llm_latency_seconds |
Histogram | model, operation |
LLM API call latency |
rag_llm_tokens_total |
Counter | model, token_type |
LLM tokens used |
rag_llm_errors_total |
Counter | model, error_type |
LLM API errors |
rag_cpg_query_latency_seconds |
Histogram | query_type |
CPG query latency |
rag_cpg_query_results |
Histogram | query_type |
CPG query result count |
rag_retrieval_latency_seconds |
Histogram | retrieval_type |
Retrieval operation latency |
rag_retrieval_results_count |
Histogram | retrieval_type |
Retrieval result count |
Total: 18 metrics (all defined in src/monitoring/metrics.py). ACTIVE_REQUESTS and TOTAL_REQUESTS have no labels.
Histogram Buckets¶
# SCENARIO_DURATION buckets (seconds)
(0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0)
# AGENT_DURATION buckets (seconds)
(0.1, 0.5, 1.0, 2.0, 5.0, 10.0)
# LLM_LATENCY buckets (seconds)
(0.1, 0.5, 1.0, 2.0, 3.0, 5.0, 10.0)
# CPG_QUERY_LATENCY buckets (seconds)
(0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0)
# CPG_QUERY_RESULTS buckets (count)
(0, 1, 5, 10, 50, 100, 500, 1000)
# RETRIEVAL_LATENCY buckets (seconds)
(0.1, 0.5, 1.0, 2.0, 5.0)
# RETRIEVAL_RESULTS buckets (count)
(0, 1, 5, 10, 20, 50)
Recording Metrics¶
from src.monitoring.metrics import (
SCENARIO_DURATION,
SCENARIO_SUCCESS,
SCENARIO_FAILURE,
LLM_LATENCY,
LLM_TOKENS,
LLM_ERRORS,
CACHE_HITS,
CACHE_MISSES,
)
# Record scenario execution
with SCENARIO_DURATION.labels(scenario_name="security_audit").time():
result = run_scenario()
SCENARIO_SUCCESS.labels(scenario_name="security_audit").inc()
# Record LLM usage
LLM_LATENCY.labels(model="GigaChat-2-Pro", operation="generate").observe(2.5)
LLM_TOKENS.labels(model="GigaChat-2-Pro", token_type="output").inc(150)
LLM_ERRORS.labels(model="GigaChat-2-Pro", error_type="timeout").inc()
# Record cache access
CACHE_HITS.labels(cache_type="query_plan").inc()
CACHE_MISSES.labels(cache_type="embedding").inc()
Recording Functions¶
Helper functions for recording metrics with multiple counters at once:
from src.monitoring import record_llm_call, record_cpg_query, record_retrieval
# Record LLM call (updates LLM_LATENCY, LLM_TOKENS, LLM_ERRORS)
record_llm_call(
model="GigaChat-2-Pro",
operation="generate",
duration=2.5,
input_tokens=100,
output_tokens=150,
error=None, # or error type string on failure
)
# Record CPG query (updates CPG_QUERY_LATENCY, CPG_QUERY_RESULTS)
record_cpg_query(
query_type="method_lookup",
duration=0.05,
result_count=12,
)
# Record retrieval operation (updates RETRIEVAL_LATENCY, RETRIEVAL_RESULTS)
record_retrieval(
retrieval_type="hybrid",
duration=1.2,
result_count=10,
)
Decorators¶
@monitor_scenario¶
Automatically tracks scenario execution: duration, success/failure, active requests.
from src.monitoring import monitor_scenario
@monitor_scenario("security_audit")
def run_security_audit(codebase: str):
# Records:
# - SCENARIO_DURATION.labels(scenario_name="security_audit")
# - SCENARIO_SUCCESS / SCENARIO_FAILURE
# - ACTIVE_REQUESTS inc/dec
# - TOTAL_REQUESTS inc
return findings
Parameters: scenario_name: str — name of the scenario.
@monitor_agent¶
Tracks agent execution with scenario context.
from src.monitoring import monitor_agent
@monitor_agent("retriever_agent", scenario="security")
def retrieve_context(query: str) -> list:
# Records:
# - AGENT_DURATION.labels(agent_name="retriever_agent", scenario="security")
# - AGENT_SUCCESS / AGENT_FAILURE
return results
Parameters: agent_name: str, scenario: str = "unknown".
@monitor_cache¶
Tracks cache hit/miss based on return value (None = miss, otherwise = hit).
from src.monitoring import monitor_cache
@monitor_cache("query_plan")
def get_cached_query(key: str) -> dict | None:
# Records:
# - CACHE_HITS.labels(cache_type="query_plan") on non-None return
# - CACHE_MISSES.labels(cache_type="query_plan") on None return
return cached_result
Parameters: cache_type: str = "query_plan".
MetricsCollector¶
Singleton for in-memory metric aggregation. Useful when Prometheus server is not available.
from src.monitoring import MetricsCollector, get_metrics_collector
# Both return the same singleton instance
collector = MetricsCollector()
collector = get_metrics_collector()
# Record metrics
collector.record_latency("query_generation", 0.245)
collector.record_scenario("security_audit", success=True, latency=1.5)
collector.record_cache_access("query_plan", hit=True)
collector.increment_counter("total_queries", amount=1)
collector.set_gauge("active_requests", 5.0)
# Get summary
summary = collector.get_summary() # Returns MetricsSummary
stats = collector.get_scenario_stats() # Per-scenario dict
cache = collector.get_cache_stats() # Per-cache-type dict
uptime = collector.get_uptime_seconds()
# Reset all collected metrics
collector.reset()
MetricsSummary¶
Dataclass returned by MetricsCollector.get_summary():
| Field | Type | Description |
|---|---|---|
total_requests |
int |
Total scenario executions |
successful_requests |
int |
Successful executions |
failed_requests |
int |
Failed executions |
success_rate |
float |
Success ratio (0.0–1.0) |
avg_latency_ms |
float |
Average latency in ms |
p50_latency_ms |
float |
50th percentile latency |
p95_latency_ms |
float |
95th percentile latency |
p99_latency_ms |
float |
99th percentile latency |
cache_hit_rate |
float |
Cache hit ratio (0.0–1.0) |
active_requests |
int |
Currently active requests |
summary = collector.get_summary()
data = summary.to_dict() # Convert to dict for JSON serialization
Health Checks¶
Health Status¶
from src.monitoring.health import HealthStatus
class HealthStatus(Enum):
HEALTHY = "healthy"
DEGRADED = "degraded"
UNHEALTHY = "unhealthy"
Component Health¶
The system checks these components by default:
HealthChecker (standalone): registered via _register_default_checks():
| Component | Check Method | Description |
|---|---|---|
database |
_check_database |
DuckDB CPG database connectivity |
llm |
_check_llm |
LLM provider response |
llm_providers |
_check_llm_providers |
Available LLM provider modules |
quality |
_check_quality |
RAG response quality feedback loop |
API Router (src/api/routers/health.py):
| Component | Check Function | Description |
|---|---|---|
database |
DatabaseHealthCheck.check() |
PostgreSQL database |
llm |
check_llm_health() |
LLM provider config |
chromadb |
check_chromadb_health() |
ChromaDB vector store |
cpg |
check_cpg_health() |
DuckDB CPG database |
Custom Health Checks¶
from src.monitoring.health import HealthChecker, ComponentHealth, HealthStatus
checker = HealthChecker()
def check_custom_service():
try:
response_time = ping_service()
if response_time < 100:
return ComponentHealth(
name="custom_service",
status=HealthStatus.HEALTHY,
latency_ms=response_time
)
else:
return ComponentHealth(
name="custom_service",
status=HealthStatus.DEGRADED,
latency_ms=response_time,
message="High latency"
)
except Exception as e:
return ComponentHealth(
name="custom_service",
status=HealthStatus.UNHEALTHY,
message=str(e)
)
checker.register_check("custom_service", check_custom_service)
Health Endpoints — Standalone App¶
Created by create_health_app() from src/monitoring/health:
| Endpoint | Description |
|---|---|
GET /health |
Overall health status |
GET /health/live |
Liveness probe |
GET /health/ready |
Readiness probe |
GET /metrics |
Prometheus metrics |
GET /stats |
System statistics |
GET /health response:
{
"status": "healthy",
"timestamp": 1702580000.0,
"uptime_seconds": 3600.0,
"version": "2.0.0",
"components": [
{
"name": "database",
"status": "healthy",
"latency_ms": 2.5,
"message": "Connected, 1500 methods",
"details": {}
},
{
"name": "llm",
"status": "healthy",
"latency_ms": 450.0,
"message": "LLM responding normally",
"details": {"provider": "yandex"}
}
]
}
GET /health/live response:
{"status": "alive"}
GET /health/ready response:
{"status": "ready"}
GET /stats response:
{
"summary": {"total_requests": 100, "success_rate": 0.95, "...": "..."},
"scenarios": {"security_audit": {"total_requests": 50, "success_rate": 0.98}},
"cache": {"query_plan": {"hits": 80, "misses": 20, "hit_rate": 0.8}},
"uptime_seconds": 3600.0
}
Health Endpoints — API Router¶
Mounted at /api/v1/health in the main FastAPI application:
| Endpoint | Description |
|---|---|
GET /api/v1/health |
Full health check (all components as dict) |
GET /api/v1/health/live |
Liveness probe |
GET /api/v1/health/ready |
Readiness probe (checks DB) |
GET /api/v1/health/version |
API version |
GET /api/v1/metrics |
Prometheus metrics (separate router) |
GET /api/v1/health response:
{
"status": "healthy",
"version": "2.0.0",
"uptime_seconds": 3600.0,
"timestamp": "2026-03-07T10:30:00Z",
"components": {
"database": {"status": "healthy"},
"llm": {"status": "healthy", "provider": "yandex"},
"chromadb": {"status": "healthy", "collections": 5},
"cpg": {"status": "healthy", "db_path": "/data/project.duckdb", "methods": 1500}
}
}
Note: In the API router, components is a dict (keyed by component name), not an array.
Standalone Functions¶
Quick-check functions for use outside the health check framework:
from src.monitoring.health import (
check_database_connection,
check_llm_availability,
check_vector_store,
run_health_check_cli,
)
# Quick boolean checks
db_ok = check_database_connection() # True if DuckDB responds to SELECT 1
llm_ok = check_llm_availability() # True if LLM is_available()
vs_ok = check_vector_store() # True if ChromaDB heartbeat succeeds
# CLI health check (prints report, exits with code 0/1/2)
run_health_check_cli()
Structured Logging¶
StructuredLogger¶
JSON-formatted logger for production environments. Outputs structured data suitable for Elasticsearch, Splunk, or CloudWatch.
from src.monitoring import StructuredLogger
logger = StructuredLogger("my_component")
# Log with structured data
logger.info("Processing request", request_id="abc123", user="test")
logger.warning("Slow query", duration_ms=500)
logger.error("Failed to connect", service="database", error="timeout")
Methods: debug(), info(), warning(), error(), critical() — all accept message: str and arbitrary **kwargs.
Log Format¶
{
"timestamp": "2026-03-07T10:30:00.000Z",
"level": "INFO",
"logger": "my_component",
"message": "Processing request",
"request_id": "abc123",
"user": "test"
}
Timed Operations¶
Context manager for timing operations with automatic success/error logging:
from src.monitoring import StructuredLogger
logger = StructuredLogger("query_engine")
with logger.timed_operation("query_generation", scenario="security"):
result = generate_query()
# On success, logs:
# {"message": "Operation completed: query_generation",
# "operation": "query_generation", "duration_ms": 245.0,
# "status": "success", "scenario": "security"}
# On exception, logs error with status="error", error_type, error_message
# then re-raises the exception
MetricsMiddleware¶
HTTP middleware that automatically records Prometheus metrics for every request. Defined in src/api/middleware/metrics.py.
from src.api.middleware.metrics import MetricsMiddleware
app.add_middleware(MetricsMiddleware)
Metrics recorded per request:
- ACTIVE_REQUESTS — incremented on entry, decremented on exit (gauge)
- TOTAL_REQUESTS — incremented on entry (counter)
- SCENARIO_DURATION — records request duration with scenario_name=f"http:{path}" label
Grafana Dashboards¶
Recommended Panels¶
Request Rate:
rate(rag_scenario_success_total[5m]) + rate(rag_scenario_failure_total[5m])
Error Rate:
rate(rag_scenario_failure_total[5m]) / (rate(rag_scenario_success_total[5m]) + rate(rag_scenario_failure_total[5m]))
P95 Latency:
histogram_quantile(0.95, rate(rag_scenario_duration_seconds_bucket[5m]))
LLM Token Usage:
sum(rate(rag_llm_tokens_total[1h])) by (model)
Cache Hit Rate:
rate(rag_cache_hits_total[5m]) / (rate(rag_cache_hits_total[5m]) + rate(rag_cache_misses_total[5m]))
LLM Error Rate:
sum(rate(rag_llm_errors_total[5m])) by (model, error_type)
Active Requests:
rag_active_requests
Alert Rules¶
Real alert rules from monitoring/rules/alerts.yml:
groups:
- name: codegraph_availability
rules:
- alert: CodeGraphAPIDown
expr: up{job="codegraph-api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "CodeGraph API is down"
- alert: CodeGraphHighErrorRate
expr: |
(
sum(rate(rag_scenario_failure_total[5m]))
/ (sum(rate(rag_scenario_success_total[5m])) + sum(rate(rag_scenario_failure_total[5m])))
) > 0.25
for: 5m
labels:
severity: warning
annotations:
summary: "High scenario error rate (>25%)"
- name: codegraph_latency
rules:
- alert: CodeGraphHighLatency
expr: histogram_quantile(0.95, sum(rate(rag_scenario_duration_seconds_bucket[5m])) by (le)) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "High scenario latency (p95 > 30s)"
- alert: CodeGraphLLMSlowResponses
expr: histogram_quantile(0.95, sum(rate(rag_llm_latency_seconds_bucket[5m])) by (le)) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "LLM response latency high (p95 > 10s)"
- name: codegraph_resources
rules:
- alert: CodeGraphHighActiveRequests
expr: rag_active_requests > 50
for: 2m
labels:
severity: warning
annotations:
summary: "High number of active requests (>50)"
- alert: CodeGraphLLMErrors
expr: sum(rate(rag_llm_errors_total[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "LLM errors detected"
- name: codegraph_infrastructure
rules:
- alert: PrometheusTargetDown
expr: up == 0
for: 3m
labels:
severity: critical
annotations:
summary: "Prometheus target {{ $labels.job }} is down"
Alertmanager¶
Alertmanager configuration in monitoring/alertmanager.yml:
- Routing: groups alerts by
alertnameandseverity. Critical alerts repeat every 1 hour, others every 4 hours. - Receiver: default webhook via
${ALERTMANAGER_WEBHOOK_URL}withsend_resolved: true. Supports Slack, email (commented out by default). - Inhibit rules: critical alerts suppress warning alerts with the same
alertname.
Yandex Cloud Dashboard¶
monitoring/yandex/dashboard.json provides a pre-built dashboard for Yandex Monitoring with CodeGraph metrics.
Kubernetes Integration¶
Deployment Configuration¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: codegraph
spec:
template:
spec:
containers:
- name: codegraph
ports:
- containerPort: 8000
name: api
livenessProbe:
httpGet:
path: /api/v1/health/live
port: api
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /api/v1/health/ready
port: api
initialDelaySeconds: 10
periodSeconds: 5
ServiceMonitor for Prometheus¶
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: codegraph
spec:
selector:
matchLabels:
app: codegraph
endpoints:
- port: api
path: /api/v1/metrics
interval: 30s
Best Practices¶
- Use decorators (
monitor_scenario,monitor_agent,monitor_cache) for automatic instrumentation - Use
StructuredLogger.timed_operation()to measure and log operation duration - Use recording functions (
record_llm_call,record_cpg_query,record_retrieval) instead of directly incrementing counters - Monitor token usage with
LLM_TOKENSto control costs - Check P95/P99 latencies not just averages
- Set alert thresholds matching your SLOs (current: error rate >25%, latency p95 >30s)
- Use
MetricsCollectorfor in-memory aggregation when Prometheus is not deployed
CISO/CTO Grafana Dashboard¶
The CISO/CTO dashboard provides executive-level visibility into the security posture and code quality of all registered projects. It is distributed as a pre-built Grafana JSON model at grafana/dashboard_ciso.json.
Import¶
- Open Grafana and navigate to Dashboards > Import.
- Upload
grafana/dashboard_ciso.jsonor paste its contents. - Select the Prometheus datasource configured for CodeGraph metrics.
- Click Import. The dashboard appears under the CodeGraph folder.
Datasource¶
The dashboard requires a Prometheus datasource pointed at the standard CodeGraph metrics endpoint:
/api/v1/metrics
Auto-refresh is set to 5 minutes by default. Adjust in Dashboard Settings > Time Options if needed.
Variables¶
Two template variables are available in the dashboard header for filtering:
| Variable | Description | Default |
|---|---|---|
$group |
Project group name. Filters all panels to show only projects in the selected group. | All |
$project |
Individual project name. When set, panels show data for that project only. | All |
Panels (12)¶
| # | Panel | Type | Description |
|---|---|---|---|
| 1 | Portfolio Health | Stat | Average codegraph_project_health_score across all projects (filtered by $group). Color thresholds: green >=70, yellow >=50, red <50. |
| 2 | Risk Distribution | Pie chart | Distribution of projects by risk level (critical, high, medium, low) based on health score ranges. |
| 3 | Red Zone Count | Stat | Number of projects with codegraph_project_health_score < 50. Shows red when count > 0. |
| 4 | SCA Vulnerabilities | Stat | Sum of codegraph_project_sca_vulnerabilities across all projects. Color thresholds: green 0, yellow >=1, red >=5. |
| 5 | Audit Score | Bar gauge | Per-project codegraph_project_audit_score displayed as horizontal bars, sorted ascending. |
| 6 | Compliance Score | Bar gauge | Per-project codegraph_project_compliance_score displayed as horizontal bars, sorted ascending. |
| 7 | Release Readiness | Pie chart | Distribution of codegraph_project_release_status values: ready (1), blocked (0), unknown (-1). |
| 8 | Red Zone by Category | Table / bar | Uses codegraph_red_zone_items_count grouped by project and category to highlight where the portfolio is failing. |
| 9 | Health Trend | Time series | codegraph_project_health_score over time for selected $project or averaged across $group. |
| 10 | Compliance Trend | Time series | codegraph_project_compliance_score over time for selected $project or averaged across $group. |
| 11 | Top 5 Worst Projects | Table | Five projects with the lowest codegraph_project_health_score. Columns: project name, health score, audit score, compliance score, risk level. |
| 12 | Top 5 Best Projects | Table | Five projects with the highest codegraph_project_health_score. Same columns as Top 5 Worst. |
Theme compatibility¶
All panels use threshold-based colors that render correctly in both Grafana dark and light themes. No custom CSS overrides are required.
Dashboard Alert Rules¶
Five Prometheus alert rules are defined in monitoring/rules/dashboard_alerts.yml to notify when projects exceed risk thresholds. These rules complement the existing CodeGraph availability and latency alerts.
| Alert Name | Condition (PromQL) | Duration (for) |
Severity | Recommended Action |
|---|---|---|---|---|
DashboardProjectRiskCritical |
codegraph_project_health_score < 30 |
10m | critical | Immediately review the project. Investigate critical SCA vulnerabilities, failing release gate checks, and compliance gaps. Escalate to the responsible team lead. |
DashboardProjectRiskHigh |
codegraph_project_health_score < 50 |
30m | warning | Schedule a remediation review within 48 hours. Check audit findings by severity and address high-priority items first. |
DashboardComplianceGap |
codegraph_project_compliance_score < 40 |
1h | warning | Run a full GOST R 56939 compliance evaluation (codegraph_compliance_gost). Focus on non-compliant processes shown in the compliance heatmap. |
DashboardReleaseGateFail |
codegraph_project_release_status == 0 |
15m | warning | Review release gate failures with codegraph_release_gate_check. Address blocking checks before the next release window. |
DashboardScaCriticalVuln |
codegraph_project_sca_vulnerabilities{severity="critical"} > 0 |
5m | critical | Patch critical CVEs immediately. Run codegraph_sbom_audit to identify affected dependencies. Consider emergency release if production is exposed. |
Example rule configuration¶
groups:
- name: codegraph_dashboard
rules:
- alert: DashboardProjectRiskCritical
expr: codegraph_project_health_score < 30
for: 10m
labels:
severity: critical
annotations:
summary: "Project {{ $labels.project }} health score critically low ({{ $value }})"
runbook: "Check audit, SCA, and compliance status. Escalate to team lead."
- alert: DashboardProjectRiskHigh
expr: codegraph_project_health_score < 50
for: 30m
labels:
severity: warning
annotations:
summary: "Project {{ $labels.project }} in red zone (health {{ $value }})"
- alert: DashboardComplianceGap
expr: codegraph_project_compliance_score < 40
for: 1h
labels:
severity: warning
annotations:
summary: "Project {{ $labels.project }} compliance score below 40% ({{ $value }})"
- alert: DashboardReleaseGateFail
expr: codegraph_project_release_status == 0
for: 15m
labels:
severity: warning
annotations:
summary: "Project {{ $labels.project }} release gate blocked"
- alert: DashboardScaCriticalVuln
expr: codegraph_project_sca_vulnerabilities{severity="critical"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Project {{ $labels.project }} has {{ $value }} critical SCA vulnerabilities"
Dashboard V2 Metrics¶
Eight dashboard gauges are exposed through the standard Prometheus scrape endpoint. All metrics are of type Gauge and are updated when dashboard aggregation code runs.
codegraph_project_health_score¶
Composite health score for a project, aggregated from audit, compliance, release gate, and SCA sub-scores.
| Property | Value |
|---|---|
| Type | Gauge |
| Labels | project, group, language |
| Unit | percent |
| Range | 0.0 – 100.0 |
# Projects with health below 50 (red zone)
codegraph_project_health_score < 50
# Average health across all projects in a group
avg(codegraph_project_health_score{group="backend"})
codegraph_project_audit_score¶
Latest audit score for a project, based on the 12-dimension audit evaluation.
| Property | Value |
|---|---|
| Type | Gauge |
| Labels | project |
| Unit | score |
| Range | 0.0 – 10.0 |
# Bottom 5 projects by audit score
bottomk(5, codegraph_project_audit_score)
codegraph_project_compliance_score¶
GOST R 56939 compliance score for a project, expressed as the percentage of passed processes.
| Property | Value |
|---|---|
| Type | Gauge |
| Labels | project |
| Unit | percent |
| Range | 0.0 – 100.0 |
# Projects below 40% compliance
codegraph_project_compliance_score < 40
codegraph_project_release_status¶
Release gate status for a project. Encoded as a numeric value: 1.0 = pass, 0.5 = warn, 0.0 = fail.
| Property | Value |
|---|---|
| Type | Gauge |
| Labels | project |
| Unit | enum (1.0/0.5/0.0) |
| Range | 0.0 – 1.0 |
# Count of failed projects
count(codegraph_project_release_status == 0)
codegraph_project_sca_health¶
SCA (Software Composition Analysis) health score for a project, based on dependency vulnerability assessment.
| Property | Value |
|---|---|
| Type | Gauge |
| Labels | project |
| Unit | percent |
| Range | 0.0 – 100.0 |
# Projects with SCA health below 70
codegraph_project_sca_health < 70
codegraph_project_sca_vulnerabilities¶
Count of SCA vulnerabilities for a project, broken down by severity.
| Property | Value |
|---|---|
| Type | Gauge |
| Labels | project, severity |
| Unit | count |
| Range | 0 – unbounded |
# Total critical vulnerabilities across all projects
sum(codegraph_project_sca_vulnerabilities{severity="critical"})
# Per-project vulnerability count
sum by (project) (codegraph_project_sca_vulnerabilities)
codegraph_portfolio_avg_health¶
Average health score across all registered projects, optionally filtered by group. This is a pre-aggregated convenience metric.
| Property | Value |
|---|---|
| Type | Gauge |
| Labels | group |
| Unit | percent |
| Range | 0.0 – 100.0 |
# Portfolio health for all groups
codegraph_portfolio_avg_health{group=""}
# Health by group
codegraph_portfolio_avg_health{group="backend"}
codegraph_red_zone_items_count¶
Number of red-zone items grouped by project and category.
| Property | Value |
|---|---|
| Type | Gauge |
| Labels | project, category |
| Unit | count |
| Range | 0 – unbounded |
# Alert when any release-related red-zone items exist
sum(codegraph_red_zone_items_count{category="release"}) > 0
Metrics endpoint¶
Dashboard gauges are served through the standard Prometheus endpoint:
GET /api/v1/metrics
The response uses Prometheus text exposition format:
# HELP codegraph_project_health_score Composite health score for a project
# TYPE codegraph_project_health_score gauge
codegraph_project_health_score{project="payments-api",group="backend"} 85.0
codegraph_project_health_score{project="legacy-auth",group="backend"} 35.0
# HELP codegraph_portfolio_avg_health Average health score across all projects
# TYPE codegraph_portfolio_avg_health gauge
codegraph_portfolio_avg_health{group=""} 72.5
codegraph_portfolio_avg_health{group="backend"} 60.0
# HELP codegraph_red_zone_items_count Number of red zone items by project and category
# TYPE codegraph_red_zone_items_count gauge
codegraph_red_zone_items_count{project="legacy-auth",category="release"} 2
Add this endpoint to your Prometheus scrape configuration:
scrape_configs:
- job_name: codegraph-dashboard
scrape_interval: 5m
metrics_path: /api/v1/metrics
static_configs:
- targets: ["localhost:8000"]
See Also¶
- REST API Reference — API endpoints including health and metrics
- MCP Tools Reference — Dashboard V2 MCP tools
- Prometheus Documentation
- Grafana Documentation