Monitoring Guide

This guide covers monitoring, metrics, and health checks for CodeGraph in production environments.

Table of Contents

Overview

The monitoring module (src/monitoring/) provides: - Prometheus metrics — 20 metrics for observability - Structured JSON loggingStructuredLogger with timed_operation() context manager - Health check endpoints — standalone FastAPI app and API router - Monitoring decoratorsmonitor_scenario, monitor_agent, monitor_cache - Recording functionsrecord_llm_call, record_cpg_query, record_retrieval - MetricsCollector — singleton for in-memory metric aggregation - MetricsMiddleware — automatic HTTP request instrumentation


Quick Start

Enable Metrics

from src.monitoring import (
    MetricsCollector,
    StructuredLogger,
    monitor_scenario,
    monitor_agent,
    monitor_cache,
    record_llm_call,
)

# Create structured logger
logger = StructuredLogger("my_component")

# Get metrics collector (singleton)
metrics = MetricsCollector()

# Use decorators for automatic tracking
@monitor_scenario("my_scenario")
def process_query(query: str):
    # Automatically records SCENARIO_DURATION, SCENARIO_SUCCESS/FAILURE,
    # ACTIVE_REQUESTS, TOTAL_REQUESTS
    return result

Health Check Application

from src.monitoring.health import create_health_app

# Create standalone health check FastAPI app
app = create_health_app()

# Endpoints available:
# GET /health       - Overall health status
# GET /health/live  - Liveness probe
# GET /health/ready - Readiness probe
# GET /metrics      - Prometheus metrics
# GET /stats        - System statistics

Prometheus Metrics

Available Metrics

Metric Type Labels Description
rag_scenario_duration_seconds Histogram scenario_name Scenario execution time
rag_scenario_success_total Counter scenario_name Successful executions
rag_scenario_failure_total Counter scenario_name, error_type Failed executions
rag_agent_duration_seconds Histogram agent_name, scenario Agent execution time
rag_agent_success_total Counter agent_name, scenario Agent successes
rag_agent_failure_total Counter agent_name, scenario, error_type Agent failures
rag_cache_hits_total Counter cache_type Cache hits
rag_cache_misses_total Counter cache_type Cache misses
rag_cache_size Gauge cache_type Current cache size
rag_active_requests Gauge Active requests being processed
rag_total_requests Counter Total requests processed
rag_llm_latency_seconds Histogram model, operation LLM API call latency
rag_llm_tokens_total Counter model, token_type LLM tokens used
rag_llm_errors_total Counter model, error_type LLM API errors
rag_cpg_query_latency_seconds Histogram query_type CPG query latency
rag_cpg_query_results Histogram query_type CPG query result count
rag_retrieval_latency_seconds Histogram retrieval_type Retrieval operation latency
rag_retrieval_results_count Histogram retrieval_type Retrieval result count

Total: 18 metrics (all defined in src/monitoring/metrics.py). ACTIVE_REQUESTS and TOTAL_REQUESTS have no labels.

Histogram Buckets

# SCENARIO_DURATION buckets (seconds)
(0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0)

# AGENT_DURATION buckets (seconds)
(0.1, 0.5, 1.0, 2.0, 5.0, 10.0)

# LLM_LATENCY buckets (seconds)
(0.1, 0.5, 1.0, 2.0, 3.0, 5.0, 10.0)

# CPG_QUERY_LATENCY buckets (seconds)
(0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0)

# CPG_QUERY_RESULTS buckets (count)
(0, 1, 5, 10, 50, 100, 500, 1000)

# RETRIEVAL_LATENCY buckets (seconds)
(0.1, 0.5, 1.0, 2.0, 5.0)

# RETRIEVAL_RESULTS buckets (count)
(0, 1, 5, 10, 20, 50)

Recording Metrics

from src.monitoring.metrics import (
    SCENARIO_DURATION,
    SCENARIO_SUCCESS,
    SCENARIO_FAILURE,
    LLM_LATENCY,
    LLM_TOKENS,
    LLM_ERRORS,
    CACHE_HITS,
    CACHE_MISSES,
)

# Record scenario execution
with SCENARIO_DURATION.labels(scenario_name="security_audit").time():
    result = run_scenario()

SCENARIO_SUCCESS.labels(scenario_name="security_audit").inc()

# Record LLM usage
LLM_LATENCY.labels(model="GigaChat-2-Pro", operation="generate").observe(2.5)
LLM_TOKENS.labels(model="GigaChat-2-Pro", token_type="output").inc(150)
LLM_ERRORS.labels(model="GigaChat-2-Pro", error_type="timeout").inc()

# Record cache access
CACHE_HITS.labels(cache_type="query_plan").inc()
CACHE_MISSES.labels(cache_type="embedding").inc()

Recording Functions

Helper functions for recording metrics with multiple counters at once:

from src.monitoring import record_llm_call, record_cpg_query, record_retrieval

# Record LLM call (updates LLM_LATENCY, LLM_TOKENS, LLM_ERRORS)
record_llm_call(
    model="GigaChat-2-Pro",
    operation="generate",
    duration=2.5,
    input_tokens=100,
    output_tokens=150,
    error=None,  # or error type string on failure
)

# Record CPG query (updates CPG_QUERY_LATENCY, CPG_QUERY_RESULTS)
record_cpg_query(
    query_type="method_lookup",
    duration=0.05,
    result_count=12,
)

# Record retrieval operation (updates RETRIEVAL_LATENCY, RETRIEVAL_RESULTS)
record_retrieval(
    retrieval_type="hybrid",
    duration=1.2,
    result_count=10,
)

Decorators

@monitor_scenario

Automatically tracks scenario execution: duration, success/failure, active requests.

from src.monitoring import monitor_scenario

@monitor_scenario("security_audit")
def run_security_audit(codebase: str):
    # Records:
    # - SCENARIO_DURATION.labels(scenario_name="security_audit")
    # - SCENARIO_SUCCESS / SCENARIO_FAILURE
    # - ACTIVE_REQUESTS inc/dec
    # - TOTAL_REQUESTS inc
    return findings

Parameters: scenario_name: str — name of the scenario.

@monitor_agent

Tracks agent execution with scenario context.

from src.monitoring import monitor_agent

@monitor_agent("retriever_agent", scenario="security")
def retrieve_context(query: str) -> list:
    # Records:
    # - AGENT_DURATION.labels(agent_name="retriever_agent", scenario="security")
    # - AGENT_SUCCESS / AGENT_FAILURE
    return results

Parameters: agent_name: str, scenario: str = "unknown".

@monitor_cache

Tracks cache hit/miss based on return value (None = miss, otherwise = hit).

from src.monitoring import monitor_cache

@monitor_cache("query_plan")
def get_cached_query(key: str) -> dict | None:
    # Records:
    # - CACHE_HITS.labels(cache_type="query_plan") on non-None return
    # - CACHE_MISSES.labels(cache_type="query_plan") on None return
    return cached_result

Parameters: cache_type: str = "query_plan".


MetricsCollector

Singleton for in-memory metric aggregation. Useful when Prometheus server is not available.

from src.monitoring import MetricsCollector, get_metrics_collector

# Both return the same singleton instance
collector = MetricsCollector()
collector = get_metrics_collector()

# Record metrics
collector.record_latency("query_generation", 0.245)
collector.record_scenario("security_audit", success=True, latency=1.5)
collector.record_cache_access("query_plan", hit=True)
collector.increment_counter("total_queries", amount=1)
collector.set_gauge("active_requests", 5.0)

# Get summary
summary = collector.get_summary()  # Returns MetricsSummary
stats = collector.get_scenario_stats()  # Per-scenario dict
cache = collector.get_cache_stats()  # Per-cache-type dict
uptime = collector.get_uptime_seconds()

# Reset all collected metrics
collector.reset()

MetricsSummary

Dataclass returned by MetricsCollector.get_summary():

Field Type Description
total_requests int Total scenario executions
successful_requests int Successful executions
failed_requests int Failed executions
success_rate float Success ratio (0.0–1.0)
avg_latency_ms float Average latency in ms
p50_latency_ms float 50th percentile latency
p95_latency_ms float 95th percentile latency
p99_latency_ms float 99th percentile latency
cache_hit_rate float Cache hit ratio (0.0–1.0)
active_requests int Currently active requests
summary = collector.get_summary()
data = summary.to_dict()  # Convert to dict for JSON serialization

Health Checks

Health Status

from src.monitoring.health import HealthStatus

class HealthStatus(Enum):
    HEALTHY = "healthy"
    DEGRADED = "degraded"
    UNHEALTHY = "unhealthy"

Component Health

The system checks these components by default:

HealthChecker (standalone): registered via _register_default_checks():

Component Check Method Description
database _check_database DuckDB CPG database connectivity
llm _check_llm LLM provider response
llm_providers _check_llm_providers Available LLM provider modules
quality _check_quality RAG response quality feedback loop

API Router (src/api/routers/health.py):

Component Check Function Description
database DatabaseHealthCheck.check() PostgreSQL database
llm check_llm_health() LLM provider config
chromadb check_chromadb_health() ChromaDB vector store
cpg check_cpg_health() DuckDB CPG database

Custom Health Checks

from src.monitoring.health import HealthChecker, ComponentHealth, HealthStatus

checker = HealthChecker()

def check_custom_service():
    try:
        response_time = ping_service()
        if response_time < 100:
            return ComponentHealth(
                name="custom_service",
                status=HealthStatus.HEALTHY,
                latency_ms=response_time
            )
        else:
            return ComponentHealth(
                name="custom_service",
                status=HealthStatus.DEGRADED,
                latency_ms=response_time,
                message="High latency"
            )
    except Exception as e:
        return ComponentHealth(
            name="custom_service",
            status=HealthStatus.UNHEALTHY,
            message=str(e)
        )

checker.register_check("custom_service", check_custom_service)

Health Endpoints — Standalone App

Created by create_health_app() from src/monitoring/health:

Endpoint Description
GET /health Overall health status
GET /health/live Liveness probe
GET /health/ready Readiness probe
GET /metrics Prometheus metrics
GET /stats System statistics

GET /health response:

{
  "status": "healthy",
  "timestamp": 1702580000.0,
  "uptime_seconds": 3600.0,
  "version": "2.0.0",
  "components": [
    {
      "name": "database",
      "status": "healthy",
      "latency_ms": 2.5,
      "message": "Connected, 1500 methods",
      "details": {}
    },
    {
      "name": "llm",
      "status": "healthy",
      "latency_ms": 450.0,
      "message": "LLM responding normally",
      "details": {"provider": "yandex"}
    }
  ]
}

GET /health/live response:

{"status": "alive"}

GET /health/ready response:

{"status": "ready"}

GET /stats response:

{
  "summary": {"total_requests": 100, "success_rate": 0.95, "...": "..."},
  "scenarios": {"security_audit": {"total_requests": 50, "success_rate": 0.98}},
  "cache": {"query_plan": {"hits": 80, "misses": 20, "hit_rate": 0.8}},
  "uptime_seconds": 3600.0
}

Health Endpoints — API Router

Mounted at /api/v1/health in the main FastAPI application:

Endpoint Description
GET /api/v1/health Full health check (all components as dict)
GET /api/v1/health/live Liveness probe
GET /api/v1/health/ready Readiness probe (checks DB)
GET /api/v1/health/version API version
GET /api/v1/metrics Prometheus metrics (separate router)

GET /api/v1/health response:

{
  "status": "healthy",
  "version": "2.0.0",
  "uptime_seconds": 3600.0,
  "timestamp": "2026-03-07T10:30:00Z",
  "components": {
    "database": {"status": "healthy"},
    "llm": {"status": "healthy", "provider": "yandex"},
    "chromadb": {"status": "healthy", "collections": 5},
    "cpg": {"status": "healthy", "db_path": "/data/project.duckdb", "methods": 1500}
  }
}

Note: In the API router, components is a dict (keyed by component name), not an array.

Standalone Functions

Quick-check functions for use outside the health check framework:

from src.monitoring.health import (
    check_database_connection,
    check_llm_availability,
    check_vector_store,
    run_health_check_cli,
)

# Quick boolean checks
db_ok = check_database_connection()    # True if DuckDB responds to SELECT 1
llm_ok = check_llm_availability()      # True if LLM is_available()
vs_ok = check_vector_store()           # True if ChromaDB heartbeat succeeds

# CLI health check (prints report, exits with code 0/1/2)
run_health_check_cli()

Structured Logging

StructuredLogger

JSON-formatted logger for production environments. Outputs structured data suitable for Elasticsearch, Splunk, or CloudWatch.

from src.monitoring import StructuredLogger

logger = StructuredLogger("my_component")

# Log with structured data
logger.info("Processing request", request_id="abc123", user="test")
logger.warning("Slow query", duration_ms=500)
logger.error("Failed to connect", service="database", error="timeout")

Methods: debug(), info(), warning(), error(), critical() — all accept message: str and arbitrary **kwargs.

Log Format

{
  "timestamp": "2026-03-07T10:30:00.000Z",
  "level": "INFO",
  "logger": "my_component",
  "message": "Processing request",
  "request_id": "abc123",
  "user": "test"
}

Timed Operations

Context manager for timing operations with automatic success/error logging:

from src.monitoring import StructuredLogger

logger = StructuredLogger("query_engine")

with logger.timed_operation("query_generation", scenario="security"):
    result = generate_query()
    # On success, logs:
    # {"message": "Operation completed: query_generation",
    #  "operation": "query_generation", "duration_ms": 245.0,
    #  "status": "success", "scenario": "security"}

    # On exception, logs error with status="error", error_type, error_message
    # then re-raises the exception

MetricsMiddleware

HTTP middleware that automatically records Prometheus metrics for every request. Defined in src/api/middleware/metrics.py.

from src.api.middleware.metrics import MetricsMiddleware

app.add_middleware(MetricsMiddleware)

Metrics recorded per request: - ACTIVE_REQUESTS — incremented on entry, decremented on exit (gauge) - TOTAL_REQUESTS — incremented on entry (counter) - SCENARIO_DURATION — records request duration with scenario_name=f"http:{path}" label


Grafana Dashboards

Request Rate:

rate(rag_scenario_success_total[5m]) + rate(rag_scenario_failure_total[5m])

Error Rate:

rate(rag_scenario_failure_total[5m]) / (rate(rag_scenario_success_total[5m]) + rate(rag_scenario_failure_total[5m]))

P95 Latency:

histogram_quantile(0.95, rate(rag_scenario_duration_seconds_bucket[5m]))

LLM Token Usage:

sum(rate(rag_llm_tokens_total[1h])) by (model)

Cache Hit Rate:

rate(rag_cache_hits_total[5m]) / (rate(rag_cache_hits_total[5m]) + rate(rag_cache_misses_total[5m]))

LLM Error Rate:

sum(rate(rag_llm_errors_total[5m])) by (model, error_type)

Active Requests:

rag_active_requests

Alert Rules

Real alert rules from monitoring/rules/alerts.yml:

groups:
  - name: codegraph_availability
    rules:
      - alert: CodeGraphAPIDown
        expr: up{job="codegraph-api"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "CodeGraph API is down"

      - alert: CodeGraphHighErrorRate
        expr: |
          (
            sum(rate(rag_scenario_failure_total[5m]))
            / (sum(rate(rag_scenario_success_total[5m])) + sum(rate(rag_scenario_failure_total[5m])))
          ) > 0.25
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High scenario error rate (>25%)"

  - name: codegraph_latency
    rules:
      - alert: CodeGraphHighLatency
        expr: histogram_quantile(0.95, sum(rate(rag_scenario_duration_seconds_bucket[5m])) by (le)) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High scenario latency (p95 > 30s)"

      - alert: CodeGraphLLMSlowResponses
        expr: histogram_quantile(0.95, sum(rate(rag_llm_latency_seconds_bucket[5m])) by (le)) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM response latency high (p95 > 10s)"

  - name: codegraph_resources
    rules:
      - alert: CodeGraphHighActiveRequests
        expr: rag_active_requests > 50
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High number of active requests (>50)"

      - alert: CodeGraphLLMErrors
        expr: sum(rate(rag_llm_errors_total[5m])) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM errors detected"

  - name: codegraph_infrastructure
    rules:
      - alert: PrometheusTargetDown
        expr: up == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus target {{ $labels.job }} is down"

Alertmanager

Alertmanager configuration in monitoring/alertmanager.yml:

  • Routing: groups alerts by alertname and severity. Critical alerts repeat every 1 hour, others every 4 hours.
  • Receiver: default webhook via ${ALERTMANAGER_WEBHOOK_URL} with send_resolved: true. Supports Slack, email (commented out by default).
  • Inhibit rules: critical alerts suppress warning alerts with the same alertname.

Yandex Cloud Dashboard

monitoring/yandex/dashboard.json provides a pre-built dashboard for Yandex Monitoring with CodeGraph metrics.


Kubernetes Integration

Deployment Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: codegraph
spec:
  template:
    spec:
      containers:
        - name: codegraph
          ports:
            - containerPort: 8000
              name: api
          livenessProbe:
            httpGet:
              path: /api/v1/health/live
              port: api
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /api/v1/health/ready
              port: api
            initialDelaySeconds: 10
            periodSeconds: 5

ServiceMonitor for Prometheus

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: codegraph
spec:
  selector:
    matchLabels:
      app: codegraph
  endpoints:
    - port: api
      path: /api/v1/metrics
      interval: 30s

Best Practices

  1. Use decorators (monitor_scenario, monitor_agent, monitor_cache) for automatic instrumentation
  2. Use StructuredLogger.timed_operation() to measure and log operation duration
  3. Use recording functions (record_llm_call, record_cpg_query, record_retrieval) instead of directly incrementing counters
  4. Monitor token usage with LLM_TOKENS to control costs
  5. Check P95/P99 latencies not just averages
  6. Set alert thresholds matching your SLOs (current: error rate >25%, latency p95 >30s)
  7. Use MetricsCollector for in-memory aggregation when Prometheus is not deployed

CISO/CTO Grafana Dashboard

The CISO/CTO dashboard provides executive-level visibility into the security posture and code quality of all registered projects. It is distributed as a pre-built Grafana JSON model at grafana/dashboard_ciso.json.

Import

  1. Open Grafana and navigate to Dashboards > Import.
  2. Upload grafana/dashboard_ciso.json or paste its contents.
  3. Select the Prometheus datasource configured for CodeGraph metrics.
  4. Click Import. The dashboard appears under the CodeGraph folder.

Datasource

The dashboard requires a Prometheus datasource pointed at the standard CodeGraph metrics endpoint:

/api/v1/metrics

Auto-refresh is set to 5 minutes by default. Adjust in Dashboard Settings > Time Options if needed.

Variables

Two template variables are available in the dashboard header for filtering:

Variable Description Default
$group Project group name. Filters all panels to show only projects in the selected group. All
$project Individual project name. When set, panels show data for that project only. All

Panels (12)

# Panel Type Description
1 Portfolio Health Stat Average codegraph_project_health_score across all projects (filtered by $group). Color thresholds: green >=70, yellow >=50, red <50.
2 Risk Distribution Pie chart Distribution of projects by risk level (critical, high, medium, low) based on health score ranges.
3 Red Zone Count Stat Number of projects with codegraph_project_health_score < 50. Shows red when count > 0.
4 SCA Vulnerabilities Stat Sum of codegraph_project_sca_vulnerabilities across all projects. Color thresholds: green 0, yellow >=1, red >=5.
5 Audit Score Bar gauge Per-project codegraph_project_audit_score displayed as horizontal bars, sorted ascending.
6 Compliance Score Bar gauge Per-project codegraph_project_compliance_score displayed as horizontal bars, sorted ascending.
7 Release Readiness Pie chart Distribution of codegraph_project_release_status values: ready (1), blocked (0), unknown (-1).
8 Red Zone by Category Table / bar Uses codegraph_red_zone_items_count grouped by project and category to highlight where the portfolio is failing.
9 Health Trend Time series codegraph_project_health_score over time for selected $project or averaged across $group.
10 Compliance Trend Time series codegraph_project_compliance_score over time for selected $project or averaged across $group.
11 Top 5 Worst Projects Table Five projects with the lowest codegraph_project_health_score. Columns: project name, health score, audit score, compliance score, risk level.
12 Top 5 Best Projects Table Five projects with the highest codegraph_project_health_score. Same columns as Top 5 Worst.

Theme compatibility

All panels use threshold-based colors that render correctly in both Grafana dark and light themes. No custom CSS overrides are required.


Dashboard Alert Rules

Five Prometheus alert rules are defined in monitoring/rules/dashboard_alerts.yml to notify when projects exceed risk thresholds. These rules complement the existing CodeGraph availability and latency alerts.

Alert Name Condition (PromQL) Duration (for) Severity Recommended Action
DashboardProjectRiskCritical codegraph_project_health_score < 30 10m critical Immediately review the project. Investigate critical SCA vulnerabilities, failing release gate checks, and compliance gaps. Escalate to the responsible team lead.
DashboardProjectRiskHigh codegraph_project_health_score < 50 30m warning Schedule a remediation review within 48 hours. Check audit findings by severity and address high-priority items first.
DashboardComplianceGap codegraph_project_compliance_score < 40 1h warning Run a full GOST R 56939 compliance evaluation (codegraph_compliance_gost). Focus on non-compliant processes shown in the compliance heatmap.
DashboardReleaseGateFail codegraph_project_release_status == 0 15m warning Review release gate failures with codegraph_release_gate_check. Address blocking checks before the next release window.
DashboardScaCriticalVuln codegraph_project_sca_vulnerabilities{severity="critical"} > 0 5m critical Patch critical CVEs immediately. Run codegraph_sbom_audit to identify affected dependencies. Consider emergency release if production is exposed.

Example rule configuration

groups:
  - name: codegraph_dashboard
    rules:
      - alert: DashboardProjectRiskCritical
        expr: codegraph_project_health_score < 30
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Project {{ $labels.project }} health score critically low ({{ $value }})"
          runbook: "Check audit, SCA, and compliance status. Escalate to team lead."

      - alert: DashboardProjectRiskHigh
        expr: codegraph_project_health_score < 50
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Project {{ $labels.project }} in red zone (health {{ $value }})"

      - alert: DashboardComplianceGap
        expr: codegraph_project_compliance_score < 40
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Project {{ $labels.project }} compliance score below 40% ({{ $value }})"

      - alert: DashboardReleaseGateFail
        expr: codegraph_project_release_status == 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Project {{ $labels.project }} release gate blocked"

      - alert: DashboardScaCriticalVuln
        expr: codegraph_project_sca_vulnerabilities{severity="critical"} > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Project {{ $labels.project }} has {{ $value }} critical SCA vulnerabilities"

Dashboard V2 Metrics

Eight dashboard gauges are exposed through the standard Prometheus scrape endpoint. All metrics are of type Gauge and are updated when dashboard aggregation code runs.

codegraph_project_health_score

Composite health score for a project, aggregated from audit, compliance, release gate, and SCA sub-scores.

Property Value
Type Gauge
Labels project, group, language
Unit percent
Range 0.0 – 100.0
# Projects with health below 50 (red zone)
codegraph_project_health_score < 50

# Average health across all projects in a group
avg(codegraph_project_health_score{group="backend"})

codegraph_project_audit_score

Latest audit score for a project, based on the 12-dimension audit evaluation.

Property Value
Type Gauge
Labels project
Unit score
Range 0.0 – 10.0
# Bottom 5 projects by audit score
bottomk(5, codegraph_project_audit_score)

codegraph_project_compliance_score

GOST R 56939 compliance score for a project, expressed as the percentage of passed processes.

Property Value
Type Gauge
Labels project
Unit percent
Range 0.0 – 100.0
# Projects below 40% compliance
codegraph_project_compliance_score < 40

codegraph_project_release_status

Release gate status for a project. Encoded as a numeric value: 1.0 = pass, 0.5 = warn, 0.0 = fail.

Property Value
Type Gauge
Labels project
Unit enum (1.0/0.5/0.0)
Range 0.0 – 1.0
# Count of failed projects
count(codegraph_project_release_status == 0)

codegraph_project_sca_health

SCA (Software Composition Analysis) health score for a project, based on dependency vulnerability assessment.

Property Value
Type Gauge
Labels project
Unit percent
Range 0.0 – 100.0
# Projects with SCA health below 70
codegraph_project_sca_health < 70

codegraph_project_sca_vulnerabilities

Count of SCA vulnerabilities for a project, broken down by severity.

Property Value
Type Gauge
Labels project, severity
Unit count
Range 0 – unbounded
# Total critical vulnerabilities across all projects
sum(codegraph_project_sca_vulnerabilities{severity="critical"})

# Per-project vulnerability count
sum by (project) (codegraph_project_sca_vulnerabilities)

codegraph_portfolio_avg_health

Average health score across all registered projects, optionally filtered by group. This is a pre-aggregated convenience metric.

Property Value
Type Gauge
Labels group
Unit percent
Range 0.0 – 100.0
# Portfolio health for all groups
codegraph_portfolio_avg_health{group=""}

# Health by group
codegraph_portfolio_avg_health{group="backend"}

codegraph_red_zone_items_count

Number of red-zone items grouped by project and category.

Property Value
Type Gauge
Labels project, category
Unit count
Range 0 – unbounded
# Alert when any release-related red-zone items exist
sum(codegraph_red_zone_items_count{category="release"}) > 0

Metrics endpoint

Dashboard gauges are served through the standard Prometheus endpoint:

GET /api/v1/metrics

The response uses Prometheus text exposition format:

# HELP codegraph_project_health_score Composite health score for a project
# TYPE codegraph_project_health_score gauge
codegraph_project_health_score{project="payments-api",group="backend"} 85.0
codegraph_project_health_score{project="legacy-auth",group="backend"} 35.0

# HELP codegraph_portfolio_avg_health Average health score across all projects
# TYPE codegraph_portfolio_avg_health gauge
codegraph_portfolio_avg_health{group=""} 72.5
codegraph_portfolio_avg_health{group="backend"} 60.0

# HELP codegraph_red_zone_items_count Number of red zone items by project and category
# TYPE codegraph_red_zone_items_count gauge
codegraph_red_zone_items_count{project="legacy-auth",category="release"} 2

Add this endpoint to your Prometheus scrape configuration:

scrape_configs:
  - job_name: codegraph-dashboard
    scrape_interval: 5m
    metrics_path: /api/v1/metrics
    static_configs:
      - targets: ["localhost:8000"]

See Also