Benchmark Framework Guide

Version 3.0 — Comprehensive evaluation platform for hybrid retrieval and scenario-based testing.

Table of Contents

Overview

The benchmark framework provides multiple evaluation tools:

  1. Hybrid Retrieval Benchmark — compares vector-only, graph-only, and hybrid retrieval modes
  2. Comprehensive Benchmark Runner — scenario-based evaluation with 20 ground truth scenarios
  3. Intent Classification Benchmark — evaluates intent classifier accuracy
  4. Symbolic Execution Benchmark — measures V2 symbolic execution impact on taint analysis
  5. LLM Disambiguation Benchmark — tests intent classifier with LLM support

Quick Start

1. Run Hybrid Retrieval Benchmark

python scripts/benchmark_hybrid_retrieval.py --output-dir benchmark_results

Output:

================================================================================
BENCHMARK SUMMARY
================================================================================
Metric               Vector       Graph        Hybrid       vs Vector    vs Graph
--------------------------------------------------------------------------------
Precision@10         0.2182       0.2000       0.3000       +37.5%       +50.0%
Recall@10            0.4327       0.3543       0.5528       +27.8%       +56.0%
F1@10                0.2864       0.2510       0.3825       +33.6%       +52.4%
MRR                  1.0000       0.6364       1.0000       +0.0%        +57.1%
NDCG@10              0.5304       0.4443       0.6590       +24.3%       +48.3%

2. Run Comprehensive Benchmark

# All scenarios, English questions
python -m tests.benchmark.run_benchmark --language en

# Quick mode (5 questions per scenario)
python -m tests.benchmark.run_benchmark -q --language en

# Specific scenarios
python -m tests.benchmark.run_benchmark --scenarios 01,02,03

3. Run With Real Data (Python API)

from scripts.benchmark_hybrid_retrieval import HybridRetrievalBenchmark
from src.retrieval.vector_store_real import VectorStoreReal
from src.services.cpg_query_service import CPGQueryService

# Initialize stores
vector_store = VectorStoreReal(persist_directory="chroma_db")
cpg_service = CPGQueryService()

# Create and run benchmark
benchmark = HybridRetrievalBenchmark(
    vector_store=vector_store,
    cpg_service=cpg_service,
    output_dir="benchmark_results"
)

import asyncio
report = asyncio.run(benchmark.run_benchmark())
benchmark.save_report(report)

Hybrid Retrieval Benchmark

The hybrid retrieval benchmark (scripts/benchmark_hybrid_retrieval.py) compares three retrieval modes: - Vector-only: Pure semantic search using ChromaDB - Graph-only: Pure structural search using DuckDB/CPG - Hybrid: RRF-merged combination of both

Key classes: - HybridRetrievalBenchmark(vector_store, cpg_service, output_dir="benchmark_results") — main orchestrator - BenchmarkQuery — dataclass for a single query with ground truth - RetrievalMetrics — dataclass for per-query metrics - BenchmarkReport — aggregate report

CLI arguments:

python scripts/benchmark_hybrid_retrieval.py [OPTIONS]
Argument Default Description
--db-path active project Path to DuckDB CPG database
--chroma-path chroma_db Path to ChromaDB vector store
--output-dir benchmark_results Output directory for results
--modes vector_only graph_only hybrid Retrieval modes to benchmark

Comprehensive Benchmark Runner

The main benchmark tool (tests/benchmark/run_benchmark.py) runs scenario-based evaluations against the full CodeGraph pipeline.

python -m tests.benchmark.run_benchmark [OPTIONS]

CLI arguments:

Argument Short Default Description
--project None Project name (auto-switches active project)
--scenarios -s all Comma-separated scenario IDs (e.g., 01,02,03)
--language -l all Filter by language (en/ru)
--difficulty -d all Filter: easy/medium/hard/expert
--max-questions -n unlimited Maximum questions per scenario
--quick -q Quick mode (5 questions per scenario)
--mock -m Use mock copilot for infrastructure testing
--trace -t true Enable traceability logging
--no-trace Disable traceability logging
--ragas -r Run RAGAS evaluation using LLM
--provider -p config LLM provider (gigachat/yandex/openai/local)
--failed-from Re-run failed questions from previous run ID
--question-ids Specific question IDs (e.g., VULN_EN_002,VULN_EN_004)
--offset 0 Skip first N questions per scenario
--randomize Randomly select questions instead of sequential

Examples:

# Re-run failed questions from previous run
python -m tests.benchmark.run_benchmark --failed-from 20260119_073256

# Run questions 7-12 (offset 6, max 6)
python -m tests.benchmark.run_benchmark --offset 6 --max-questions 6

# Randomized test (3 random questions per scenario)
python -m tests.benchmark.run_benchmark --randomize --max-questions 3

# Specific question IDs
python -m tests.benchmark.run_benchmark --question-ids "VULN_EN_002,VULN_EN_004"

LLM Provider Selection

The benchmark runner supports multiple LLM providers for RAGAS evaluation via the --provider CLI argument.

Supported Providers

Provider Flag Environment Variables Model
GigaChat --provider gigachat GIGACHAT_API_KEY or GIGACHAT_CREDENTIALS GigaChat-2-Pro
Yandex --provider yandex YANDEX_API_KEY, YANDEX_FOLDER_ID qwen3-235b-a22b-fp8/latest
OpenAI --provider openai OPENAI_API_KEY gpt-4
Local --provider local LOCAL_MODEL_PATH llama.cpp compatible

Usage Examples

# Run with Yandex provider (Qwen3 model)
python -m tests.benchmark.run_benchmark --provider yandex --ragas

# Run with GigaChat
python -m tests.benchmark.run_benchmark --provider gigachat --ragas

# Run with OpenAI
python -m tests.benchmark.run_benchmark --provider openai --ragas

# Quick run without RAGAS evaluation
python -m tests.benchmark.run_benchmark --provider yandex -q

Provider Configuration

Providers are configured in config.yaml:

llm:
  provider: yandex  # Default provider

  yandex:
    api_key: ${YANDEX_API_KEY}
    folder_id: ${YANDEX_FOLDER_ID}
    model: "qwen3-235b-a22b-fp8/latest"
    base_url: "https://llm.api.cloud.yandex.net/v1"
    timeout: 60

  gigachat:
    auth_key: ${GIGACHAT_AUTH_KEY}
    model: "GigaChat-2-Pro"

  openai:
    api_key: ${OPENAI_API_KEY}
    model: "gpt-4"

Provider-Specific Notes

Yandex Cloud AI Studio: - Uses OpenAI-compatible API - Default model: Qwen3 235B (high quality) - Privacy compliant: data logging disabled by default - Supports Russian language queries

GigaChat: - Sber’s Russian LLM - Best for Russian language content - Requires certificate handling on some systems

Local Models: - Uses llama.cpp via llama-cpp-python - No API costs, fully offline - Requires local GPU or sufficient CPU

Benchmark Dataset

The hybrid retrieval benchmark includes 11 queries across 4 types:

Semantic Queries (4 queries)

Focus on semantic understanding and documentation: - “How does PostgreSQL handle transaction commits?” - “What is the purpose of the buffer manager?” - “How does PostgreSQL implement multi-version concurrency control?” - “How does the query optimizer choose between index scan and sequential scan?” (mixed)

Expected Behavior: - Vector-only: High performance (80-90% relevance) - Graph-only: Low performance (20-30% relevance) - Hybrid: Best performance (combines semantic understanding)

Structural Queries (4 queries)

Focus on graph traversal and dependencies: - “Show me the call path from BeginTransactionBlock to CommitTransactionCommand” - “Find all functions that call malloc” - “What are the indirect callers of MemoryContextAlloc (depth 2-3)?” - “Trace the execution path for a SELECT statement with WHERE clause” (mixed)

Expected Behavior: - Vector-only: Low performance (20-30% relevance) - Graph-only: High performance (80-90% relevance) - Hybrid: Best performance (leverages graph structure)

Security Queries (3 queries)

Require both semantic patterns AND structural analysis: - “Find potential SQL injection vulnerabilities in query building functions” - “Identify functions that allocate memory without proper error checking” - “Find buffer overflow risks in string manipulation functions”

Expected Behavior: - Vector-only: Moderate (50-60% relevance) - Graph-only: Moderate (50-60% relevance) - Hybrid: Best performance (combines both)

Note: Queries marked (mixed) require both semantic understanding and structural traversal. They appear in the semantic/structural categories based on their primary query_type value.

Evaluation Modules

The benchmark framework includes 4 evaluation modules in tests/benchmark/evaluation/:

IR Metrics (ir_metrics.py)

Standard Information Retrieval metrics via the IRMetrics class:

Metric Method Description
Precision@K precision_at_k(retrieved, relevant, k) Fraction of relevant in top-K
Recall@K recall_at_k(retrieved, relevant, k) Fraction of relevant documents found
F1@K f1_at_k(retrieved, relevant, k) Harmonic mean of P and R
MRR mrr(retrieved, relevant) Reciprocal rank of first relevant
NDCG@K ndcg_at_k(retrieved, relevant, highly_relevant, k) Graded relevance ranking
Average Precision average_precision(retrieved, relevant) Area under P-R curve
Hit Rate@K hit_rate_at_k(retrieved, relevant, k) Binary hit indicator

Default K values: [5, 10, 20]. Use compute_all() to compute all metrics at once.

Performance Metrics (performance_metrics.py)

Execution performance via the PerformanceMetrics class: - Latency: min, max, mean, median, p50, p95, p99 - Token usage: input, output, total, per-question, per-second - Cache: hit count, miss count, hit rate - SQL complexity: weighted scoring (JOINs +2, subqueries +3, aggregations +1)

Accuracy Metrics (accuracy_metrics.py)

Answer quality via the AccuracyMetrics class: - Semantic similarity: cosine similarity via sentence-transformers (model: paraphrase-multilingual-MiniLM-L12-v2) - Keyword coverage: keyword presence check with Russian grammatical case support - Function coverage: precision/recall/F1 for retrieved vs expected functions - Pattern matching: regex-based pattern validation - Factual accuracy: composite accuracy score

Multi-Entity Metrics (multi_entity_metrics.py)

Multi-entity evaluation via the MultiEntityIRMetrics class supporting 9 entity types:

Entity Type Weight Description
functions 1.0 Function definitions
external_functions 0.9 External/library functions
structs 0.9 Struct/class definitions
macros 0.8 Macro definitions
types 0.7 Type definitions
enums 0.7 Enum definitions
callers 0.85 Calling functions
callees 0.85 Called functions
files 0.95 Source files

Features: fuzzy function name matching (substring/prefix), cross-platform file path normalization, weighted metric combination across entity types.

Scenario-Based Testing

The comprehensive benchmark uses ground truth questions organized by scenario:

tests/benchmark/ground_truth/
+-- scenario_01_onboarding/
|   +-- questions_en.yaml
|   +-- questions_ru.yaml
|   +-- questions_en_codegraph.yaml  (project-specific)
+-- scenario_02_security_audit/
+-- ...
+-- scenario_20_dependencies/

20 scenario directories with EN + RU question sets. The benchmark configuration (benchmark_config.yaml) defines 16 core scenarios; scenarios 17–20 (file_editing, code_optimization, standards_check, dependencies) have ground truth but are not yet in the config.

Ground Truth YAML Format

scenario:
  id: "scenario_01_onboarding"
  name: "Codebase Onboarding"
  mapped_workflow: "onboarding_workflow"

metadata:
  version: "1.0"
  language: "en"
  question_count: 35
  difficulty_distribution:
    easy: 12
    medium: 15
    hard: 8

questions:
  - id: "ONBOARD_EN_001"
    question: "Where is heap_insert defined?"
    category: "definition_search"
    difficulty: "easy"
    postgresql_subsystem: "storage"
    target_function: "heap_insert"
    ground_truth:
      expected_functions: ["heap_insert"]
      expected_files: ["heapam.c"]
      required_keywords: ["heap", "insert", "tuple"]
      keyword_coverage_only: false
    evaluation:
      metrics: ["ir_metrics", "accuracy"]
      semantic_similarity_threshold: 0.7

Key fields in ground_truth: - expected_functions, expected_callers, expected_callees — entity lists for IR metrics - expected_structs, expected_macros, expected_types, expected_enums — additional entity types - expected_files — expected source files - required_keywords — keywords that must appear in the answer - key_patterns — regex patterns for validation - keyword_coverage_only: true — for conceptual questions without specific entities

Configuration

Benchmark Config (tests/benchmark/config/benchmark_config.yaml)

benchmark:
  name: "CodeGraph Comprehensive Benchmark"
  version: "2.1"

execution:
  k_values: [5, 10, 20]
  max_parallel_questions: 1
  question_timeout: 60
  enable_tracing: true
  languages: ["en", "ru"]

thresholds:
  easy:
    precision_at_10: 0.3
    recall_at_10: 0.5
    mrr: 0.4
    semantic_similarity: 0.5
    keyword_coverage: 0.5
  medium:
    precision_at_10: 0.2
    recall_at_10: 0.3
    mrr: 0.3
  hard:
    precision_at_10: 0.1
    recall_at_10: 0.2
    mrr: 0.2

success_criteria:
  min_scenario_pass_rate: 0.5
  min_scenarios_passed: 8
  min_overall_pass_rate: 0.5

Additional Benchmarks

Intent Classification Benchmark

python -m tests.benchmark.intent_benchmark --language en --show-failures

Evaluates intent classifier accuracy across 17 scenarios. Supports --compare-languages for EN vs RU comparison.

Symbolic Execution Benchmark

python scripts/benchmark_symbolic_execution.py --db data/projects/codegraph.duckdb

Measures V2 symbolic execution impact: FP filtering rate, time/memory overhead, parser coverage.

LLM Disambiguation Benchmark

python scripts/benchmark_with_llm.py --llm --verbose

Tests intent classifier with LLM disambiguation enabled vs rule-based only.

Metrics Reference

Precision@K

P@K = (# relevant in top-K) / K
  • 1.0 = All top-K results are relevant
  • 0.0 = No relevant results in top-K

Recall@K

R@K = (# relevant in top-K) / (total # relevant)
  • 1.0 = All relevant documents retrieved
  • 0.0 = No relevant documents retrieved

F1@K

F1 = 2 * (P * R) / (P + R)

Harmonic mean of precision and recall — balanced measure of ranking quality.

Mean Reciprocal Rank (MRR)

MRR = 1 / (rank of first relevant result)
  • 1.0 = First result is relevant
  • 0.5 = Second result is relevant
  • 0.0 = No relevant results

Normalized Discounted Cumulative Gain (NDCG@K)

NDCG@K = DCG@K / IDCG@K
DCG@K = Σ (2^rel_i - 1) / log2(i + 1)

Graded relevance: highly relevant (rel=2) > relevant (rel=1) > not relevant (rel=0). - 1.0 = Perfect ranking - 0.0 = Worst ranking

Custom Benchmark Queries

Create custom queries with ground truth for the hybrid retrieval benchmark:

from scripts.benchmark_hybrid_retrieval import BenchmarkQuery

custom_queries = [
    BenchmarkQuery(
        id="custom_001",
        query="Your question here",
        query_type="semantic",  # or "structural", "security"
        description="Description of what this tests",
        relevant_node_ids={1001, 1002, 1003, 1004},  # CPG node IDs
        highly_relevant_node_ids={1001, 1002},        # Most important nodes
        expected_difficulty="medium"  # "easy", "medium", "hard"
    ),
]

# Run with custom queries
import asyncio
report = asyncio.run(benchmark.run_benchmark(queries=custom_queries))

Output Files

Hybrid Retrieval Benchmark

Two files in benchmark_results/:

  1. JSON Report (hybrid_benchmark_<timestamp>.json) — complete machine-readable results: per-query metrics, aggregate metrics by mode, retrieved node IDs, score breakdowns.

  2. Markdown Report (hybrid_benchmark_<timestamp>.md) — human-readable summary: comparison table, key findings, improvement percentages.

Comprehensive Benchmark Runner

Results in tests/benchmark/results/{RUN_ID}/:

tests/benchmark/results/20260307_120000/
+-- summary.json          # Main results summary
+-- scenario_01.json      # Per-scenario breakdown
+-- scenario_02.json
+-- traces/
|   +-- ONBOARD_EN_001.trace
|   +-- VULN_EN_002.trace
+-- metadata.json         # Run metadata

Understanding Results

Interpreting Improvements

Positive improvements (+) are good:

Hybrid F1@10 vs Vector: +33.6%
= Hybrid retrieval is 33.6% better than vector-only

When Hybrid Outperforms: - Semantic queries: Hybrid >= Vector > Graph - Structural queries: Hybrid >= Graph > Vector - Mixed queries: Hybrid > Vector, Graph

Example Analysis

Query: “How does PostgreSQL handle transaction commits?” - Type: Semantic - Results: - Vector: P@10=0.60, R@10=0.80, F1@10=0.69 - Graph: P@10=0.20, R@10=0.30, F1@10=0.24 - Hybrid: P@10=0.70, R@10=0.90, F1@10=0.79

Analysis: - Vector performs well (semantic query) - Graph struggles (not structural) - Hybrid best: +14.5% over vector (RRF adds structural context)

Performance Considerations

Latency

Hybrid retrieval is slower than single-source due to parallel execution overhead. Actual latency depends on data size, hardware, and query complexity.

Trade-off: Hybrid sacrifices latency for 30-50% better relevance.

Caching

For production use: 1. Cache frequent queries 2. Pre-compute embeddings 3. Use connection pooling for DuckDB

Unit Tests

Run benchmark metrics tests:

pytest tests/unit/test_benchmark_metrics.py -v

Coverage (30 tests across 8 classes): - TestPrecisionAtK: 5 tests - TestRecallAtK: 4 tests - TestF1Score: 4 tests - TestMRR: 5 tests - TestNDCG: 6 tests - TestBenchmarkQuery: 2 tests - TestRetrievalMetrics: 2 tests - TestBenchmarkDataset: 3 tests

Additional test files: - tests/unit/test_benchmark_ir_coverage.py — validates IR metrics coverage across all ground truth questions

Best Practices

1. Diverse Query Set

Include queries of different types (semantic, structural, security, mixed) and difficulties (easy, medium, hard).

2. Representative Ground Truth

Ensure ground truth reflects real-world relevance judgments: - Mark highly relevant nodes explicitly - Include partial matches in relevant set - Use domain experts for validation

3. Multiple Metrics

Don’t rely on a single metric: - F1@10: Overall ranking quality - MRR: User experience (time to first relevant) - NDCG@10: Graded relevance (highly relevant vs relevant)

4. Error Analysis

Examine per-query results to understand: - Which query types benefit most from hybrid? - Where does each mode fail? - How to improve adaptive weighting?

5. Iterative Debugging

Use advanced filtering for efficient debugging: - --failed-from RUN_ID to re-run only failures - --offset N --max-questions M for batch testing - --question-ids for specific questions

Troubleshooting

Error: “No module named ‘src.retrieval’”

Solution: Run from project root directory.

Error: “ModuleNotFoundError: VectorStore”

Solution: Ensure all dependencies installed:

pip install chromadb sentence-transformers duckdb

DuckDB Lock Error

Solution: Ensure gocpg.exe is not running (check with ps aux | grep gocpg).

Low Benchmark Scores

Solution: Check that: 1. ChromaDB has indexed documents (chroma_db/ directory exists) 2. DuckDB has CPG data (*.duckdb file is valid) 3. Domain is correctly set (config.yamldomain.name)

Next Steps

  1. Run hybrid benchmark to understand retrieval quality
  2. Run comprehensive benchmark for full pipeline evaluation
  3. Examine output reports (JSON + Markdown)
  4. Customize benchmark queries or ground truth for your project
  5. Use --failed-from to iteratively fix failures
  6. Tune RRF weights based on findings

Support

For issues or questions: - Create issue in GitHub repository - Include benchmark configuration and error logs - Provide sample query that fails