User Guide

Complete guide to using CodeGraph for code analysis.

Table of Contents

Overview

CodeGraph answers natural language questions about codebases by combining: - Semantic search - Find code by meaning and intent - Structural search - Traverse call graphs and data flow - LLM synthesis - Generate human-readable answers

Basic Usage

Interactive Mode

python examples/demo_simple.py

Enter questions at the prompt:

> What does CommitTransaction do?
> Find methods that handle memory allocation
> Show the call chain from executor to storage

Programmatic Usage

from src.workflow import MultiScenarioCopilot

copilot = MultiScenarioCopilot()
question = "What methods handle transaction commits?"
result = copilot.run(question)

print(f"Answer: {result['answer']}")
print(f"Confidence: {result.get('confidence', 'N/A')}")
print(f"Intent: {result.get('intent')}")

The run() method accepts optional parameters:

result = copilot.run(
    query="Find SQL injection vulnerabilities",
    context={"scenario_id": "scenario_2"},  # Force security scenario
    language="en",                           # Response language (en/ru)
)

Question Types

Definition Queries

Find where code is defined:

Find method 'heap_insert'
Where is AbortTransaction defined?
Show me the RelationGetBufferForTuple function

Relationship Queries

Understand code relationships:

What methods call LWLockAcquire?
Find callers of MemoryContextCreate
What does heap_insert call?

Semantic Queries

Ask about behavior and purpose:

How does PostgreSQL handle MVCC?
Explain the transaction commit process
What mechanism ensures durability?

Security Queries

Find vulnerabilities:

Find potential SQL injection points
Show unsanitized user input paths
Find buffer overflow risks

Understanding Results

Result Structure

copilot.run() returns a MultiScenarioState dictionary with these key fields:

{
    # Core output
    "answer": "CommitTransaction finalizes a transaction by...",
    "confidence": 0.85,                # Intent classification confidence (0.0–1.0)
    "intent": "security",              # Detected intent (e.g., security, performance)
    "scenario_id": "scenario_2",       # Executed scenario ID

    # Supporting data
    "evidence": ["xact.c:1234 — CommitTransaction calls..."],
    "cpg_results": [...],              # Raw CPG query results
    "metadata": {...},                 # Scenario-specific metadata

    # Classification
    "classification_method": "keyword", # "keyword" or "llm"

    # Error handling
    "error": None,                     # Error message if any
}

The full MultiScenarioState TypedDict (src/workflow/state.py) contains 21 keys including query, context, language, subsystems, methods, call_graph, retrieved_functions, retry_count, enrichment_config, vector_store, db_path, collection_prefix, and pre_retrieval_results.

Confidence Levels

The confidence field reflects the intent classifier’s certainty, not answer quality:

Level Meaning
> 0.9 High confidence - keyword match or forced scenario
0.7-0.9 Good confidence - LLM classification
0.5-0.7 Moderate - ambiguous intent
< 0.5 Low - fallback to default scenario

Advanced Features

Hybrid Search Mode

Combine semantic (ChromaDB) and structural (DuckDB CPG) search:

from src.agents.retriever_agent import RetrieverAgent
from src.retrieval.vector_store import VectorStoreReal
from src.agents.analyzer_agent import AnalyzerAgent
from src.services.cpg import CPGQueryService

# Initialize dependencies
vector_store = VectorStoreReal()
analyzer_agent = AnalyzerAgent()
cpg_service = CPGQueryService()

# Create retriever with hybrid mode
retriever = RetrieverAgent(
    vector_store=vector_store,
    analyzer_agent=analyzer_agent,
    cpg_service=cpg_service,   # Enables hybrid retrieval
    enable_hybrid=True,
)

# Run hybrid retrieval
results = retriever.retrieve_hybrid(
    question="Find memory allocation patterns",
    mode="hybrid",             # "hybrid", "vector_only", or "graph_only"
    query_type="structural",   # Hint: "semantic", "structural", "security"
    top_k=10,
)

Multi-Domain Analysis

Switch between codebases using ProjectManager:

from src.project_manager import ProjectManager

pm = ProjectManager()

# Switch to a registered project (activates DB, collections, domain)
pm.switch_project("postgresql")

# Analyze in the context of this project
from src.workflow import MultiScenarioCopilot
copilot = MultiScenarioCopilot()
result = copilot.run("Find buffer overflow risks")

# Switch to another project
pm.switch_project("linux_kernel")
result = copilot.run("Find use-after-free patterns")

Projects are registered in config.yamlprojects with db_path, source_path, language, and domain fields.

Scenario-Based Analysis

Use the copilot for scenario-based analysis (intent is detected automatically):

from src.workflow import MultiScenarioCopilot

copilot = MultiScenarioCopilot()

# Security analysis - intent detected automatically
result = copilot.run("Find SQL injection vulnerabilities")
print(f"Intent: {result.get('intent')}")  # → 'security'

# Performance analysis
result = copilot.run("Find functions with high cyclomatic complexity")
print(f"Intent: {result.get('intent')}")  # → 'performance'

# Force a specific scenario (e.g., security = scenario_2)
result = copilot.run(
    "Analyze authentication module",
    context={"scenario_id": "scenario_2"},
)

21 scenarios are available (S01 onboarding through S21 interface_docs_sync). See Scenarios for the full list.

Search for code patterns using GoCPG’s tree-sitter CST matching with CPG-aware constraints:

Using the Patterns CLI Programmatically

import asyncio
from src.services.gocpg import GoCPGClient

async def main():
    client = GoCPGClient()

    # Ad-hoc pattern search (no CPG DB needed)
    results = await client.search(pattern="malloc($x)", language="c", max_results=50)

    # CPG-aware scan with rules
    results = await client.scan(
        db_path="data/projects/postgres.duckdb",
        rule_id="unchecked-return",
    )
    print(results)

asyncio.run(main())

Pattern Findings via CPG Query Service

from src.services.cpg import CPGQueryService

cpg = CPGQueryService()

# Query persisted pattern findings
findings = cpg.get_pattern_findings(severity="high")
stats = cpg.get_pattern_statistics()

Best Practices

Writing Effective Questions

Good questions: - “What functions handle memory allocation in the buffer manager?” - “Show the call path from parser to executor” - “Find unsanitized inputs that reach database queries”

Less effective: - “Tell me about the code” (too vague) - “Fix this bug” (action request, not analysis) - “Everything about transactions” (too broad)

Optimizing Performance

  1. Be specific - Narrow questions get faster answers
  2. Use structural queries - When you know the pattern
  3. Enable caching - For repeated similar queries
  4. Limit scope - Add file or subsystem constraints

Interpreting Answers

  1. Check evidence - Verify the code references in result['evidence']
  2. Consider confidence - Lower confidence = verify manually
  3. Follow up - Ask clarifying questions
  4. Cross-reference - Compare with actual code

Workflow Integration

CI/CD Integration

# .github/workflows/code-analysis.yml
- name: Run Code Analysis
  run: |
    python -c "
    from src.workflow import MultiScenarioCopilot
    copilot = MultiScenarioCopilot()
    result = copilot.run('Find potential security issues')
    if result.get('error'):
        print(f'Analysis error: {result[\"error\"]}')
        exit(1)
    print(result['answer'])
    "

Code Review

# Run automated patch review demo
python examples/demo_patch_review.py --db data/projects/myproject.duckdb

# Available flags:
#   --db PATH       Path to DuckDB CPG database
#   --no-dod        Disable Definition of Done functionality
#   --auto-dod      Auto-generate DoD instead of extracting from PR body
#   --interactive   Enable interactive DoD confirmation prompts

Documentation Generation

from src.workflow import MultiScenarioCopilot

copilot = MultiScenarioCopilot()
result = copilot.run("Document the transaction subsystem")

# result['answer'] contains generated documentation
print(result['answer'])

Next Steps