Scenario 01: Codebase Onboarding

New developer joins the team and needs to understand codebase structure quickly using CPG-powered analysis.

Table of Contents

Quick Start

# Select Onboarding Scenario
/select 01

How It Works

Query Type Detection

When you ask a question, OnboardingQueryDetector classifies it using 4 mixin-based detectors in a 5-phase pipeline:

Detector Mixin Methods Detects
DefinitionDetectorMixin 4 Function location, signature, file path queries
CallGraphDetectorMixin 11 Callers, callees, paths, cycles, centrality, SCC/WCC
DataflowDetectorMixin 4 Data flow, taint propagation, privilege boundaries
SubsystemDetectorMixin 14 Subsystem overview, debug, business logic, external context

Detection runs in 5 phases:

Query
  |
  v
[Phase 1] Early exit — parenthesized function definitions
  |
  v
[Phase 2] Complex structural checks (17 rules)
  |
  v
[Phase 3] Lifecycle trace & execution trace
  |
  v
[Phase 4] Dispatch table — priority-ordered detection rules
  |
  v
[Phase 5] Fallback chain
  |
  v
DetectionResult(type, target, target_variants, variable,
                module_filter, file_path_filter, directory_filter)

Supports both English and Russian queries with morphological keyword matching via keyword_match_morphological().

Handler Architecture

The onboarding scenario uses a handler-based architecture with OnboardingHandlerRegistry containing 65 registered handlers across 7 categories:

DetectionResult
    |
    v
OnboardingHandlerRegistry → find handler by query type
    |
    v
OnboardingHandler.handle(query_info)
    |
    v
OnboardingResult
    |-- should_return=True  → return formatted answer (+ optional enrichment)
    |-- should_return=False → gather data, fall through to LLM
    |
    v
Post-processing safety net (catches missed queries)
    |
    v
LLM response generation (if no handler matched)

Each handler inherits from OnboardingHandler and returns an OnboardingResult with 8 fields:

Field Type Description
cpg_results list CPG query results
retrieved_functions list Function names for context
answer str Formatted response
evidence list Evidence strings
handled_flag str State flag to prevent re-handling
should_return bool Return immediately or continue to LLM
graph_context dict Additional graph data
skip_enrichment bool Opt-out of LLM enrichment

Retrieval Strategy

The base OnboardingHandler provides 6 specialized search methods for multi-source retrieval:

Method Source Purpose
_vector_search_fallback() ChromaDB code_comments Semantic search in docstrings when CPG returns nothing
_snippet_search() ChromaDB code_snippets Search in function implementations
_domain_pattern_search() ChromaDB domain_patterns Find security/taint patterns
_vector_supplement() All sources 4-phase additive merge (see below)
_get_pre_retrieval_names() state["pre_retrieval_results"] Integrate HybridRetriever results
enrich_retrieved_functions() CPG Add entity types (function/class/interface)

The _vector_supplement() method performs a 4-phase additive merge to maximize recall:

Phase 1: CPG exact matches (highest priority)
Phase 2: Pre-retrieval hybrid results (from HybridRetriever)
Phase 3: Vector search additions (semantic similarity)
Phase 4: Code snippet additions (implementation search)
    → Deduplicated merged list

Enrichment Pipeline

When a handler returns should_return=True and enrichment is enabled, a 3-phase enrichment pipeline runs:

OnboardingResult (from handler)
    |
    v
[Phase 1] CPG Comments
    |-- Batch-fetch function descriptions from code comments
    |-- Merge descriptions into cpg_results
    |-- Reformat answer with descriptions
    |
    v
[Phase 2] Vector Context (parallel, 4 workers)
    |-- Q&A pairs (retrieve_qa)
    |-- SQL examples (retrieve_sql)
    |-- Generated docs (retrieve_generated_docs)
    |-- Code comments (retrieve_comments)
    |-- Append to evidence
    |
    v
[Phase 3] LLM Synthesis (optional)
    |-- Build prompt with CPG + vector context
    |-- Generate enriched answer (cached by query hash)
    |-- Preserve original as _original_answer

Handlers can set skip_enrichment=True to opt out (e.g., for structured tabular responses).

Use Cases

Day 1: Codebase Overview

Ask high-level questions about the system:

> What is the executor subsystem?

╭─────────────── Answer ────────────────╮
│ The executor subsystem is responsible │
│ for executing query plans generated   │
│ by the planner.                       │
│                                       │
│ Key components:                       │
│   - ExecutorStart: Initialize state   │
│   - ExecutorRun: Main execution loop  │
│   - ExecutorEnd: Cleanup resources    │
│                                       │
│ Entry point: src/backend/executor/    │
│              execMain.c               │
╰───────────────────────────────────────╯

> What are the main entry points in the executor?

> Show me the architecture of query execution

Finding Definitions

The DefinitionOnboardingHandler locates function implementations with signatures and related functions, formatted by DefinitionFormatter:

> Where is palloc defined?

╭─────────────── Answer ────────────────╮
│ palloc is defined in:                 │
│   src/backend/utils/mmgr/mcxt.c:1089  │
│                                       │
│ Signature:                            │
│   void *palloc(Size size)             │
│                                       │
│ Related functions:                    │
│   palloc0(), palloc_extended(),       │
│   repalloc(), pfree()                 │
╰───────────────────────────────────────╯

Understanding Call Graphs

The CallGraphOnboardingHandler family (8 handlers) provides callers, callees, shortest paths, cycle detection, SCC/WCC analysis, and centrality metrics, formatted by CallGraphFormatter:

> Show me all callers of palloc

╭─────────────── Callers ───────────────╮
│ 1. heap_form_tuple()                  │
│ 2. ExecStoreTuple()                   │
│ 3. construct_array()                  │
│ 4. pnstrdup()                         │
│ 5. SPI_connect()                      │
│ ... (showing top 5 of 2,847 callers)  │
╰───────────────────────────────────────╯

> What functions does LWLockAcquire call?

> Find shortest call path from main to heap_insert

> Detect cycles in the call graph involving executor

Tracing Data Flow

The DataflowOnboardingHandler family (5 handlers) traces data flow, taint propagation, privilege boundaries, reaching definitions, and interprocedural flows:

> How does data flow from pg_parse_query to executor?

╭─────────────── Data Flow ─────────────╮
│ pg_parse_query()                      │
│     ↓                                 │
│ pg_analyze_and_rewrite()              │
│     ↓                                 │
│ pg_plan_queries()                     │
│     ↓                                 │
│ PortalRun()                           │
│     ↓                                 │
│ ExecutorRun()                         │
│     ↓                                 │
│ ExecProcNode()                        │
╰───────────────────────────────────────╯

External Context

The external context handlers query Git history and issue trackers:

> Who wrote heap_insert?              → AuthorQueryOnboardingHandler
> What issues are related to executor? → IssueQueryOnboardingHandler
> Show error hotspots in parser        → ErrorHotspotOnboardingHandler
> What functions change most often?    → ChurnAnalysisOnboardingHandler

Handler Categories

Core Handlers

Category Count Key Handlers
Definition 1 DefinitionOnboardingHandler — signatures, locations, related functions
Call Graph 8 CallGraph, ShortestPath, CallDepth, CycleDetection, SCC, WCC, Centrality, Intersection
Dataflow 5 Dataflow, TaintFlow, PrivilegeBoundary, ReachingDefinitions, InterproceduralFlow
Transitive Calls 4 LimitedCallers, MultiLevelCallers, TransitiveCallees, CommonCallees
Query Types 3 Debug, BusinessLogic, Subsystem
External Context 4 Author, Issues, ErrorHotspot, ChurnAnalysis
Utility 5 General, MechanismExplanation, FunctionSearch, ProjectStatistics, FileStructure

Cross-Scenario Adapters

19 adapters bridge onboarding to other scenarios without leaving the onboarding context:

  • Clone detection: CloneDetection, SemanticClone, StructuralClone, PatternClone
  • Documentation: APIDoc, ModuleDoc, SystemDoc, FunctionDoc
  • Debugging: DebugPoints, Hooks, MacroDefinition, ErrorPaths, ErrorHandling
  • Other: ExtensionPoints, APIMigration, UnitTest, DMLFlow, TraceExecution

Security & Refactoring Handlers

Category Count Handlers
Security 5 MemorySafetyVuln, InputValidationVuln, AuthSecurity, RaceCondition, NetworkEntry
Refactoring 4 BulkOperations, ExtractionCandidates, FullAudit, DDLFlow
Analysis subtypes 14 DeadCode, Tracking, Memory, Concurrency, EntryPoint, TrustBoundary, and more

Configuration

Handler limits in config.yaml via HandlerLimits dataclass:

workflows:
  handlers:
    limits:
      retrieved_functions: 25    # Final result list size
      cpg_results: 50            # CPG query result limit
      display_items: 15          # Items shown in output
      callers: 15                # Call graph callers limit
      callees: 15                # Call graph callees limit

Enrichment config (optional, disabled by default):

workflows:
  onboarding:
    enrichment:
      enable: false              # Enable enrichment pipeline
      enable_vector: false       # Enable vector context phase
      enable_llm: false          # Enable LLM synthesis phase
      vector_top_k: 3            # Vector search results per collection
      max_functions_to_describe: 10

All domain-specific behavior (debug categories, DML functions, business logic keywords, subsystem aliases) comes from the active domain plugin via get_active_domain() — nothing is hardcoded.

CLI Usage

# Query-based onboarding
python -m src.cli query "Where is palloc defined?"

# Deep analysis with LLM
python -m src.cli exec --prompt "Explain the executor subsystem architecture"

# Call graph exploration
python -m src.cli query "Show callers of heap_insert"

# Data flow tracing
python -m src.cli query "How does data flow from parser to executor?"

Example Questions

Definitions & Structure: - “Where is palloc defined?” - “What is the executor subsystem?” - “Show file structure of src/backend/executor” - “How many functions are in the codebase?”

Call Graphs: - “Show all callers of heap_insert” - “What functions does ExecProcNode call?” - “Find shortest path from main to SPI_execute” - “Detect cycles in the call graph” - “Show betweenness centrality for executor functions”

Data Flow: - “How does data flow from pg_parse_query to executor?” - “Trace taint from user input to SQL execution” - “Show privilege boundary crossings”

External Context: - “Who wrote heap_insert?” - “What issues are related to the parser?” - “Show error hotspots in utils” - “What functions change most often?”

Security (via onboarding): - “Find memory safety issues in executor” - “Show input validation gaps” - “Find race conditions in shared memory”