Version 3.0 — Comprehensive evaluation platform for hybrid retrieval and scenario-based testing.
Table of Contents¶
- Overview
- Quick Start
- Hybrid Retrieval Benchmark
- Comprehensive Benchmark Runner
- LLM Provider Selection
- Benchmark Dataset
- Evaluation Modules
- Scenario-Based Testing
- Configuration
- Additional Benchmarks
- Metrics Reference
- Custom Benchmark Queries
- Output Files
- Understanding Results
- Performance Considerations
- Unit Tests
- Best Practices
- Troubleshooting
- Next Steps
- Support
Overview¶
The benchmark framework provides multiple evaluation tools:
- Hybrid Retrieval Benchmark — compares vector-only, graph-only, and hybrid retrieval modes
- Comprehensive Benchmark Runner — scenario-based evaluation with 20 ground truth scenarios
- Intent Classification Benchmark — evaluates intent classifier accuracy
- Symbolic Execution Benchmark — measures V2 symbolic execution impact on taint analysis
- LLM Disambiguation Benchmark — tests intent classifier with LLM support
Quick Start¶
1. Run Hybrid Retrieval Benchmark¶
python scripts/benchmark_hybrid_retrieval.py --output-dir benchmark_results
Output:
================================================================================
BENCHMARK SUMMARY
================================================================================
Metric Vector Graph Hybrid vs Vector vs Graph
--------------------------------------------------------------------------------
Precision@10 0.2182 0.2000 0.3000 +37.5% +50.0%
Recall@10 0.4327 0.3543 0.5528 +27.8% +56.0%
F1@10 0.2864 0.2510 0.3825 +33.6% +52.4%
MRR 1.0000 0.6364 1.0000 +0.0% +57.1%
NDCG@10 0.5304 0.4443 0.6590 +24.3% +48.3%
2. Run Comprehensive Benchmark¶
# All scenarios, English questions
python -m tests.benchmark.run_benchmark --language en
# Quick mode (5 questions per scenario)
python -m tests.benchmark.run_benchmark -q --language en
# Specific scenarios
python -m tests.benchmark.run_benchmark --scenarios 01,02,03
3. Run With Real Data (Python API)¶
from scripts.benchmark_hybrid_retrieval import HybridRetrievalBenchmark
from src.retrieval.vector_store_real import VectorStoreReal
from src.services.cpg_query_service import CPGQueryService
# Initialize stores
vector_store = VectorStoreReal(persist_directory="chroma_db")
cpg_service = CPGQueryService()
# Create and run benchmark
benchmark = HybridRetrievalBenchmark(
vector_store=vector_store,
cpg_service=cpg_service,
output_dir="benchmark_results"
)
import asyncio
report = asyncio.run(benchmark.run_benchmark())
benchmark.save_report(report)
Hybrid Retrieval Benchmark¶
The hybrid retrieval benchmark (scripts/benchmark_hybrid_retrieval.py) compares three retrieval modes:
- Vector-only: Pure semantic search using ChromaDB
- Graph-only: Pure structural search using DuckDB/CPG
- Hybrid: RRF-merged combination of both
Key classes:
- HybridRetrievalBenchmark(vector_store, cpg_service, output_dir="benchmark_results") — main orchestrator
- BenchmarkQuery — dataclass for a single query with ground truth
- RetrievalMetrics — dataclass for per-query metrics
- BenchmarkReport — aggregate report
CLI arguments:
python scripts/benchmark_hybrid_retrieval.py [OPTIONS]
| Argument | Default | Description |
|---|---|---|
--db-path |
active project | Path to DuckDB CPG database |
--chroma-path |
chroma_db |
Path to ChromaDB vector store |
--output-dir |
benchmark_results |
Output directory for results |
--modes |
vector_only graph_only hybrid |
Retrieval modes to benchmark |
Comprehensive Benchmark Runner¶
The main benchmark tool (tests/benchmark/run_benchmark.py) runs scenario-based evaluations against the full CodeGraph pipeline.
python -m tests.benchmark.run_benchmark [OPTIONS]
CLI arguments:
| Argument | Short | Default | Description |
|---|---|---|---|
--project |
None |
Project name (auto-switches active project) | |
--scenarios |
-s |
all | Comma-separated scenario IDs (e.g., 01,02,03) |
--language |
-l |
all | Filter by language (en/ru) |
--difficulty |
-d |
all | Filter: easy/medium/hard/expert |
--max-questions |
-n |
unlimited | Maximum questions per scenario |
--quick |
-q |
Quick mode (5 questions per scenario) | |
--mock |
-m |
Use mock copilot for infrastructure testing | |
--trace |
-t |
true |
Enable traceability logging |
--no-trace |
Disable traceability logging | ||
--ragas |
-r |
Run RAGAS evaluation using LLM | |
--provider |
-p |
config | LLM provider (gigachat/yandex/openai/local) |
--failed-from |
Re-run failed questions from previous run ID | ||
--question-ids |
Specific question IDs (e.g., VULN_EN_002,VULN_EN_004) |
||
--offset |
0 |
Skip first N questions per scenario | |
--randomize |
Randomly select questions instead of sequential |
Examples:
# Re-run failed questions from previous run
python -m tests.benchmark.run_benchmark --failed-from 20260119_073256
# Run questions 7-12 (offset 6, max 6)
python -m tests.benchmark.run_benchmark --offset 6 --max-questions 6
# Randomized test (3 random questions per scenario)
python -m tests.benchmark.run_benchmark --randomize --max-questions 3
# Specific question IDs
python -m tests.benchmark.run_benchmark --question-ids "VULN_EN_002,VULN_EN_004"
LLM Provider Selection¶
The benchmark runner supports multiple LLM providers for RAGAS evaluation via the --provider CLI argument.
Supported Providers¶
| Provider | Flag | Environment Variables | Model |
|---|---|---|---|
| GigaChat | --provider gigachat |
GIGACHAT_API_KEY or GIGACHAT_CREDENTIALS |
GigaChat-2-Pro |
| Yandex | --provider yandex |
YANDEX_API_KEY, YANDEX_FOLDER_ID |
qwen3-235b-a22b-fp8/latest |
| OpenAI | --provider openai |
OPENAI_API_KEY |
gpt-4 |
| Local | --provider local |
LOCAL_MODEL_PATH |
llama.cpp compatible |
Usage Examples¶
# Run with Yandex provider (Qwen3 model)
python -m tests.benchmark.run_benchmark --provider yandex --ragas
# Run with GigaChat
python -m tests.benchmark.run_benchmark --provider gigachat --ragas
# Run with OpenAI
python -m tests.benchmark.run_benchmark --provider openai --ragas
# Quick run without RAGAS evaluation
python -m tests.benchmark.run_benchmark --provider yandex -q
Provider Configuration¶
Providers are configured in config.yaml:
llm:
provider: yandex # Default provider
yandex:
api_key: ${YANDEX_API_KEY}
folder_id: ${YANDEX_FOLDER_ID}
model: "qwen3-235b-a22b-fp8/latest"
base_url: "https://llm.api.cloud.yandex.net/v1"
timeout: 60
gigachat:
auth_key: ${GIGACHAT_AUTH_KEY}
model: "GigaChat-2-Pro"
openai:
api_key: ${OPENAI_API_KEY}
model: "gpt-4"
Provider-Specific Notes¶
Yandex Cloud AI Studio: - Uses OpenAI-compatible API - Default model: Qwen3 235B (high quality) - Privacy compliant: data logging disabled by default - Supports Russian language queries
GigaChat: - Sber’s Russian LLM - Best for Russian language content - Requires certificate handling on some systems
Local Models: - Uses llama.cpp via llama-cpp-python - No API costs, fully offline - Requires local GPU or sufficient CPU
Benchmark Dataset¶
The hybrid retrieval benchmark includes 11 queries across 4 types:
Semantic Queries (4 queries)¶
Focus on semantic understanding and documentation: - “How does PostgreSQL handle transaction commits?” - “What is the purpose of the buffer manager?” - “How does PostgreSQL implement multi-version concurrency control?” - “How does the query optimizer choose between index scan and sequential scan?” (mixed)
Expected Behavior: - Vector-only: High performance (80-90% relevance) - Graph-only: Low performance (20-30% relevance) - Hybrid: Best performance (combines semantic understanding)
Structural Queries (4 queries)¶
Focus on graph traversal and dependencies: - “Show me the call path from BeginTransactionBlock to CommitTransactionCommand” - “Find all functions that call malloc” - “What are the indirect callers of MemoryContextAlloc (depth 2-3)?” - “Trace the execution path for a SELECT statement with WHERE clause” (mixed)
Expected Behavior: - Vector-only: Low performance (20-30% relevance) - Graph-only: High performance (80-90% relevance) - Hybrid: Best performance (leverages graph structure)
Security Queries (3 queries)¶
Require both semantic patterns AND structural analysis: - “Find potential SQL injection vulnerabilities in query building functions” - “Identify functions that allocate memory without proper error checking” - “Find buffer overflow risks in string manipulation functions”
Expected Behavior: - Vector-only: Moderate (50-60% relevance) - Graph-only: Moderate (50-60% relevance) - Hybrid: Best performance (combines both)
Note: Queries marked (mixed) require both semantic understanding and structural traversal. They appear in the semantic/structural categories based on their primary
query_typevalue.
Evaluation Modules¶
The benchmark framework includes 4 evaluation modules in tests/benchmark/evaluation/:
IR Metrics (ir_metrics.py)¶
Standard Information Retrieval metrics via the IRMetrics class:
| Metric | Method | Description |
|---|---|---|
| Precision@K | precision_at_k(retrieved, relevant, k) |
Fraction of relevant in top-K |
| Recall@K | recall_at_k(retrieved, relevant, k) |
Fraction of relevant documents found |
| F1@K | f1_at_k(retrieved, relevant, k) |
Harmonic mean of P and R |
| MRR | mrr(retrieved, relevant) |
Reciprocal rank of first relevant |
| NDCG@K | ndcg_at_k(retrieved, relevant, highly_relevant, k) |
Graded relevance ranking |
| Average Precision | average_precision(retrieved, relevant) |
Area under P-R curve |
| Hit Rate@K | hit_rate_at_k(retrieved, relevant, k) |
Binary hit indicator |
Default K values: [5, 10, 20]. Use compute_all() to compute all metrics at once.
Performance Metrics (performance_metrics.py)¶
Execution performance via the PerformanceMetrics class:
- Latency: min, max, mean, median, p50, p95, p99
- Token usage: input, output, total, per-question, per-second
- Cache: hit count, miss count, hit rate
- SQL complexity: weighted scoring (JOINs +2, subqueries +3, aggregations +1)
Accuracy Metrics (accuracy_metrics.py)¶
Answer quality via the AccuracyMetrics class:
- Semantic similarity: cosine similarity via sentence-transformers (model: paraphrase-multilingual-MiniLM-L12-v2)
- Keyword coverage: keyword presence check with Russian grammatical case support
- Function coverage: precision/recall/F1 for retrieved vs expected functions
- Pattern matching: regex-based pattern validation
- Factual accuracy: composite accuracy score
Multi-Entity Metrics (multi_entity_metrics.py)¶
Multi-entity evaluation via the MultiEntityIRMetrics class supporting 9 entity types:
| Entity Type | Weight | Description |
|---|---|---|
functions |
1.0 | Function definitions |
external_functions |
0.9 | External/library functions |
structs |
0.9 | Struct/class definitions |
macros |
0.8 | Macro definitions |
types |
0.7 | Type definitions |
enums |
0.7 | Enum definitions |
callers |
0.85 | Calling functions |
callees |
0.85 | Called functions |
files |
0.95 | Source files |
Features: fuzzy function name matching (substring/prefix), cross-platform file path normalization, weighted metric combination across entity types.
Scenario-Based Testing¶
The comprehensive benchmark uses ground truth questions organized by scenario:
tests/benchmark/ground_truth/
+-- scenario_01_onboarding/
| +-- questions_en.yaml
| +-- questions_ru.yaml
| +-- questions_en_codegraph.yaml (project-specific)
+-- scenario_02_security_audit/
+-- ...
+-- scenario_20_dependencies/
20 scenario directories with EN + RU question sets. The benchmark configuration (benchmark_config.yaml) defines 16 core scenarios; scenarios 17–20 (file_editing, code_optimization, standards_check, dependencies) have ground truth but are not yet in the config.
Ground Truth YAML Format¶
scenario:
id: "scenario_01_onboarding"
name: "Codebase Onboarding"
mapped_workflow: "onboarding_workflow"
metadata:
version: "1.0"
language: "en"
question_count: 35
difficulty_distribution:
easy: 12
medium: 15
hard: 8
questions:
- id: "ONBOARD_EN_001"
question: "Where is heap_insert defined?"
category: "definition_search"
difficulty: "easy"
postgresql_subsystem: "storage"
target_function: "heap_insert"
ground_truth:
expected_functions: ["heap_insert"]
expected_files: ["heapam.c"]
required_keywords: ["heap", "insert", "tuple"]
keyword_coverage_only: false
evaluation:
metrics: ["ir_metrics", "accuracy"]
semantic_similarity_threshold: 0.7
Key fields in ground_truth:
- expected_functions, expected_callers, expected_callees — entity lists for IR metrics
- expected_structs, expected_macros, expected_types, expected_enums — additional entity types
- expected_files — expected source files
- required_keywords — keywords that must appear in the answer
- key_patterns — regex patterns for validation
- keyword_coverage_only: true — for conceptual questions without specific entities
Configuration¶
Benchmark Config (tests/benchmark/config/benchmark_config.yaml)¶
benchmark:
name: "CodeGraph Comprehensive Benchmark"
version: "2.1"
execution:
k_values: [5, 10, 20]
max_parallel_questions: 1
question_timeout: 60
enable_tracing: true
languages: ["en", "ru"]
thresholds:
easy:
precision_at_10: 0.3
recall_at_10: 0.5
mrr: 0.4
semantic_similarity: 0.5
keyword_coverage: 0.5
medium:
precision_at_10: 0.2
recall_at_10: 0.3
mrr: 0.3
hard:
precision_at_10: 0.1
recall_at_10: 0.2
mrr: 0.2
success_criteria:
min_scenario_pass_rate: 0.5
min_scenarios_passed: 8
min_overall_pass_rate: 0.5
Additional Benchmarks¶
Intent Classification Benchmark¶
python -m tests.benchmark.intent_benchmark --language en --show-failures
Evaluates intent classifier accuracy across 17 scenarios. Supports --compare-languages for EN vs RU comparison.
Symbolic Execution Benchmark¶
python scripts/benchmark_symbolic_execution.py --db data/projects/codegraph.duckdb
Measures V2 symbolic execution impact: FP filtering rate, time/memory overhead, parser coverage.
LLM Disambiguation Benchmark¶
python scripts/benchmark_with_llm.py --llm --verbose
Tests intent classifier with LLM disambiguation enabled vs rule-based only.
Metrics Reference¶
Precision@K¶
P@K = (# relevant in top-K) / K
- 1.0 = All top-K results are relevant
- 0.0 = No relevant results in top-K
Recall@K¶
R@K = (# relevant in top-K) / (total # relevant)
- 1.0 = All relevant documents retrieved
- 0.0 = No relevant documents retrieved
F1@K¶
F1 = 2 * (P * R) / (P + R)
Harmonic mean of precision and recall — balanced measure of ranking quality.
Mean Reciprocal Rank (MRR)¶
MRR = 1 / (rank of first relevant result)
- 1.0 = First result is relevant
- 0.5 = Second result is relevant
- 0.0 = No relevant results
Normalized Discounted Cumulative Gain (NDCG@K)¶
NDCG@K = DCG@K / IDCG@K
DCG@K = Σ (2^rel_i - 1) / log2(i + 1)
Graded relevance: highly relevant (rel=2) > relevant (rel=1) > not relevant (rel=0). - 1.0 = Perfect ranking - 0.0 = Worst ranking
Custom Benchmark Queries¶
Create custom queries with ground truth for the hybrid retrieval benchmark:
from scripts.benchmark_hybrid_retrieval import BenchmarkQuery
custom_queries = [
BenchmarkQuery(
id="custom_001",
query="Your question here",
query_type="semantic", # or "structural", "security"
description="Description of what this tests",
relevant_node_ids={1001, 1002, 1003, 1004}, # CPG node IDs
highly_relevant_node_ids={1001, 1002}, # Most important nodes
expected_difficulty="medium" # "easy", "medium", "hard"
),
]
# Run with custom queries
import asyncio
report = asyncio.run(benchmark.run_benchmark(queries=custom_queries))
Output Files¶
Hybrid Retrieval Benchmark¶
Two files in benchmark_results/:
-
JSON Report (
hybrid_benchmark_<timestamp>.json) — complete machine-readable results: per-query metrics, aggregate metrics by mode, retrieved node IDs, score breakdowns. -
Markdown Report (
hybrid_benchmark_<timestamp>.md) — human-readable summary: comparison table, key findings, improvement percentages.
Comprehensive Benchmark Runner¶
Results in tests/benchmark/results/{RUN_ID}/:
tests/benchmark/results/20260307_120000/
+-- summary.json # Main results summary
+-- scenario_01.json # Per-scenario breakdown
+-- scenario_02.json
+-- traces/
| +-- ONBOARD_EN_001.trace
| +-- VULN_EN_002.trace
+-- metadata.json # Run metadata
Understanding Results¶
Interpreting Improvements¶
Positive improvements (+) are good:
Hybrid F1@10 vs Vector: +33.6%
= Hybrid retrieval is 33.6% better than vector-only
When Hybrid Outperforms: - Semantic queries: Hybrid >= Vector > Graph - Structural queries: Hybrid >= Graph > Vector - Mixed queries: Hybrid > Vector, Graph
Example Analysis¶
Query: “How does PostgreSQL handle transaction commits?” - Type: Semantic - Results: - Vector: P@10=0.60, R@10=0.80, F1@10=0.69 - Graph: P@10=0.20, R@10=0.30, F1@10=0.24 - Hybrid: P@10=0.70, R@10=0.90, F1@10=0.79
Analysis: - Vector performs well (semantic query) - Graph struggles (not structural) - Hybrid best: +14.5% over vector (RRF adds structural context)
Performance Considerations¶
Latency¶
Hybrid retrieval is slower than single-source due to parallel execution overhead. Actual latency depends on data size, hardware, and query complexity.
Trade-off: Hybrid sacrifices latency for 30-50% better relevance.
Caching¶
For production use: 1. Cache frequent queries 2. Pre-compute embeddings 3. Use connection pooling for DuckDB
Unit Tests¶
Run benchmark metrics tests:
pytest tests/unit/test_benchmark_metrics.py -v
Coverage (30 tests across 8 classes): - TestPrecisionAtK: 5 tests - TestRecallAtK: 4 tests - TestF1Score: 4 tests - TestMRR: 5 tests - TestNDCG: 6 tests - TestBenchmarkQuery: 2 tests - TestRetrievalMetrics: 2 tests - TestBenchmarkDataset: 3 tests
Additional test files:
- tests/unit/test_benchmark_ir_coverage.py — validates IR metrics coverage across all ground truth questions
Best Practices¶
1. Diverse Query Set¶
Include queries of different types (semantic, structural, security, mixed) and difficulties (easy, medium, hard).
2. Representative Ground Truth¶
Ensure ground truth reflects real-world relevance judgments: - Mark highly relevant nodes explicitly - Include partial matches in relevant set - Use domain experts for validation
3. Multiple Metrics¶
Don’t rely on a single metric: - F1@10: Overall ranking quality - MRR: User experience (time to first relevant) - NDCG@10: Graded relevance (highly relevant vs relevant)
4. Error Analysis¶
Examine per-query results to understand: - Which query types benefit most from hybrid? - Where does each mode fail? - How to improve adaptive weighting?
5. Iterative Debugging¶
Use advanced filtering for efficient debugging:
- --failed-from RUN_ID to re-run only failures
- --offset N --max-questions M for batch testing
- --question-ids for specific questions
Troubleshooting¶
Error: “No module named ‘src.retrieval’”¶
Solution: Run from project root directory.
Error: “ModuleNotFoundError: VectorStore”¶
Solution: Ensure all dependencies installed:
pip install chromadb sentence-transformers duckdb
DuckDB Lock Error¶
Solution: Ensure gocpg.exe is not running (check with ps aux | grep gocpg).
Low Benchmark Scores¶
Solution: Check that:
1. ChromaDB has indexed documents (chroma_db/ directory exists)
2. DuckDB has CPG data (*.duckdb file is valid)
3. Domain is correctly set (config.yaml → domain.name)
Next Steps¶
- Run hybrid benchmark to understand retrieval quality
- Run comprehensive benchmark for full pipeline evaluation
- Examine output reports (JSON + Markdown)
- Customize benchmark queries or ground truth for your project
- Use
--failed-fromto iteratively fix failures - Tune RRF weights based on findings
Support¶
For issues or questions: - Create issue in GitHub repository - Include benchmark configuration and error logs - Provide sample query that fails