New developer joins the team and needs to understand codebase structure quickly using CPG-powered analysis.
Table of Contents¶
- Quick Start
- How It Works
- Query Type Detection
- Handler Architecture
- Retrieval Strategy
- Enrichment Pipeline
- Use Cases
- Day 1: Codebase Overview
- Finding Definitions
- Understanding Call Graphs
- Tracing Data Flow
- External Context
- Handler Categories
- Core Handlers
- Cross-Scenario Adapters
- Security & Refactoring Handlers
- Configuration
- CLI Usage
- Example Questions
- Related Scenarios
Quick Start¶
# Select Onboarding Scenario
/select 01
How It Works¶
Query Type Detection¶
When you ask a question, OnboardingQueryDetector classifies it using 4 mixin-based detectors in a 5-phase pipeline:
| Detector Mixin | Methods | Detects |
|---|---|---|
DefinitionDetectorMixin |
4 | Function location, signature, file path queries |
CallGraphDetectorMixin |
11 | Callers, callees, paths, cycles, centrality, SCC/WCC |
DataflowDetectorMixin |
4 | Data flow, taint propagation, privilege boundaries |
SubsystemDetectorMixin |
14 | Subsystem overview, debug, business logic, external context |
Detection runs in 5 phases:
Query
|
v
[Phase 1] Early exit — parenthesized function definitions
|
v
[Phase 2] Complex structural checks (17 rules)
|
v
[Phase 3] Lifecycle trace & execution trace
|
v
[Phase 4] Dispatch table — priority-ordered detection rules
|
v
[Phase 5] Fallback chain
|
v
DetectionResult(type, target, target_variants, variable,
module_filter, file_path_filter, directory_filter)
Supports both English and Russian queries with morphological keyword matching via keyword_match_morphological().
Handler Architecture¶
The onboarding scenario uses a handler-based architecture with OnboardingHandlerRegistry containing 65 registered handlers across 7 categories:
DetectionResult
|
v
OnboardingHandlerRegistry → find handler by query type
|
v
OnboardingHandler.handle(query_info)
|
v
OnboardingResult
|-- should_return=True → return formatted answer (+ optional enrichment)
|-- should_return=False → gather data, fall through to LLM
|
v
Post-processing safety net (catches missed queries)
|
v
LLM response generation (if no handler matched)
Each handler inherits from OnboardingHandler and returns an OnboardingResult with 8 fields:
| Field | Type | Description |
|---|---|---|
cpg_results |
list | CPG query results |
retrieved_functions |
list | Function names for context |
answer |
str | Formatted response |
evidence |
list | Evidence strings |
handled_flag |
str | State flag to prevent re-handling |
should_return |
bool | Return immediately or continue to LLM |
graph_context |
dict | Additional graph data |
skip_enrichment |
bool | Opt-out of LLM enrichment |
Retrieval Strategy¶
The base OnboardingHandler provides 6 specialized search methods for multi-source retrieval:
| Method | Source | Purpose |
|---|---|---|
_vector_search_fallback() |
ChromaDB code_comments |
Semantic search in docstrings when CPG returns nothing |
_snippet_search() |
ChromaDB code_snippets |
Search in function implementations |
_domain_pattern_search() |
ChromaDB domain_patterns |
Find security/taint patterns |
_vector_supplement() |
All sources | 4-phase additive merge (see below) |
_get_pre_retrieval_names() |
state["pre_retrieval_results"] |
Integrate HybridRetriever results |
enrich_retrieved_functions() |
CPG | Add entity types (function/class/interface) |
The _vector_supplement() method performs a 4-phase additive merge to maximize recall:
Phase 1: CPG exact matches (highest priority)
Phase 2: Pre-retrieval hybrid results (from HybridRetriever)
Phase 3: Vector search additions (semantic similarity)
Phase 4: Code snippet additions (implementation search)
→ Deduplicated merged list
Enrichment Pipeline¶
When a handler returns should_return=True and enrichment is enabled, a 3-phase enrichment pipeline runs:
OnboardingResult (from handler)
|
v
[Phase 1] CPG Comments
|-- Batch-fetch function descriptions from code comments
|-- Merge descriptions into cpg_results
|-- Reformat answer with descriptions
|
v
[Phase 2] Vector Context (parallel, 4 workers)
|-- Q&A pairs (retrieve_qa)
|-- SQL examples (retrieve_sql)
|-- Generated docs (retrieve_generated_docs)
|-- Code comments (retrieve_comments)
|-- Append to evidence
|
v
[Phase 3] LLM Synthesis (optional)
|-- Build prompt with CPG + vector context
|-- Generate enriched answer (cached by query hash)
|-- Preserve original as _original_answer
Handlers can set skip_enrichment=True to opt out (e.g., for structured tabular responses).
Use Cases¶
Day 1: Codebase Overview¶
Ask high-level questions about the system:
> What is the executor subsystem?
╭─────────────── Answer ────────────────╮
│ The executor subsystem is responsible │
│ for executing query plans generated │
│ by the planner. │
│ │
│ Key components: │
│ - ExecutorStart: Initialize state │
│ - ExecutorRun: Main execution loop │
│ - ExecutorEnd: Cleanup resources │
│ │
│ Entry point: src/backend/executor/ │
│ execMain.c │
╰───────────────────────────────────────╯
> What are the main entry points in the executor?
> Show me the architecture of query execution
Finding Definitions¶
The DefinitionOnboardingHandler locates function implementations with signatures and related functions, formatted by DefinitionFormatter:
> Where is palloc defined?
╭─────────────── Answer ────────────────╮
│ palloc is defined in: │
│ src/backend/utils/mmgr/mcxt.c:1089 │
│ │
│ Signature: │
│ void *palloc(Size size) │
│ │
│ Related functions: │
│ palloc0(), palloc_extended(), │
│ repalloc(), pfree() │
╰───────────────────────────────────────╯
Understanding Call Graphs¶
The CallGraphOnboardingHandler family (8 handlers) provides callers, callees, shortest paths, cycle detection, SCC/WCC analysis, and centrality metrics, formatted by CallGraphFormatter:
> Show me all callers of palloc
╭─────────────── Callers ───────────────╮
│ 1. heap_form_tuple() │
│ 2. ExecStoreTuple() │
│ 3. construct_array() │
│ 4. pnstrdup() │
│ 5. SPI_connect() │
│ ... (showing top 5 of 2,847 callers) │
╰───────────────────────────────────────╯
> What functions does LWLockAcquire call?
> Find shortest call path from main to heap_insert
> Detect cycles in the call graph involving executor
Tracing Data Flow¶
The DataflowOnboardingHandler family (5 handlers) traces data flow, taint propagation, privilege boundaries, reaching definitions, and interprocedural flows:
> How does data flow from pg_parse_query to executor?
╭─────────────── Data Flow ─────────────╮
│ pg_parse_query() │
│ ↓ │
│ pg_analyze_and_rewrite() │
│ ↓ │
│ pg_plan_queries() │
│ ↓ │
│ PortalRun() │
│ ↓ │
│ ExecutorRun() │
│ ↓ │
│ ExecProcNode() │
╰───────────────────────────────────────╯
External Context¶
The external context handlers query Git history and issue trackers:
> Who wrote heap_insert? → AuthorQueryOnboardingHandler
> What issues are related to executor? → IssueQueryOnboardingHandler
> Show error hotspots in parser → ErrorHotspotOnboardingHandler
> What functions change most often? → ChurnAnalysisOnboardingHandler
Handler Categories¶
Core Handlers¶
| Category | Count | Key Handlers |
|---|---|---|
| Definition | 1 | DefinitionOnboardingHandler — signatures, locations, related functions |
| Call Graph | 8 | CallGraph, ShortestPath, CallDepth, CycleDetection, SCC, WCC, Centrality, Intersection |
| Dataflow | 5 | Dataflow, TaintFlow, PrivilegeBoundary, ReachingDefinitions, InterproceduralFlow |
| Transitive Calls | 4 | LimitedCallers, MultiLevelCallers, TransitiveCallees, CommonCallees |
| Query Types | 3 | Debug, BusinessLogic, Subsystem |
| External Context | 4 | Author, Issues, ErrorHotspot, ChurnAnalysis |
| Utility | 5 | General, MechanismExplanation, FunctionSearch, ProjectStatistics, FileStructure |
Cross-Scenario Adapters¶
19 adapters bridge onboarding to other scenarios without leaving the onboarding context:
- Clone detection: CloneDetection, SemanticClone, StructuralClone, PatternClone
- Documentation: APIDoc, ModuleDoc, SystemDoc, FunctionDoc
- Debugging: DebugPoints, Hooks, MacroDefinition, ErrorPaths, ErrorHandling
- Other: ExtensionPoints, APIMigration, UnitTest, DMLFlow, TraceExecution
Security & Refactoring Handlers¶
| Category | Count | Handlers |
|---|---|---|
| Security | 5 | MemorySafetyVuln, InputValidationVuln, AuthSecurity, RaceCondition, NetworkEntry |
| Refactoring | 4 | BulkOperations, ExtractionCandidates, FullAudit, DDLFlow |
| Analysis subtypes | 14 | DeadCode, Tracking, Memory, Concurrency, EntryPoint, TrustBoundary, and more |
Configuration¶
Handler limits in config.yaml via HandlerLimits dataclass:
workflows:
handlers:
limits:
retrieved_functions: 25 # Final result list size
cpg_results: 50 # CPG query result limit
display_items: 15 # Items shown in output
callers: 15 # Call graph callers limit
callees: 15 # Call graph callees limit
Enrichment config (optional, disabled by default):
workflows:
onboarding:
enrichment:
enable: false # Enable enrichment pipeline
enable_vector: false # Enable vector context phase
enable_llm: false # Enable LLM synthesis phase
vector_top_k: 3 # Vector search results per collection
max_functions_to_describe: 10
All domain-specific behavior (debug categories, DML functions, business logic keywords, subsystem aliases) comes from the active domain plugin via get_active_domain() — nothing is hardcoded.
CLI Usage¶
# Query-based onboarding
python -m src.cli query "Where is palloc defined?"
# Deep analysis with LLM
python -m src.cli exec --prompt "Explain the executor subsystem architecture"
# Call graph exploration
python -m src.cli query "Show callers of heap_insert"
# Data flow tracing
python -m src.cli query "How does data flow from parser to executor?"
Example Questions¶
Definitions & Structure: - “Where is palloc defined?” - “What is the executor subsystem?” - “Show file structure of src/backend/executor” - “How many functions are in the codebase?”
Call Graphs: - “Show all callers of heap_insert” - “What functions does ExecProcNode call?” - “Find shortest path from main to SPI_execute” - “Detect cycles in the call graph” - “Show betweenness centrality for executor functions”
Data Flow: - “How does data flow from pg_parse_query to executor?” - “Trace taint from user input to SQL execution” - “Show privilege boundary crossings”
External Context: - “Who wrote heap_insert?” - “What issues are related to the parser?” - “Show error hotspots in utils” - “What functions change most often?”
Security (via onboarding): - “Find memory safety issues in executor” - “Show input validation gaps” - “Find race conditions in shared memory”
Related Scenarios¶
- Feature Development (S04) — Once you understand the codebase
- Debugging (S15) — For troubleshooting issues
- Architecture (S11) — For deeper architectural understanding
- Security Audit (S02) — For security-focused exploration