Complete guide to using CodeGraph for code analysis.
Table of Contents¶
- Overview
- Basic Usage
- Interactive Mode
- Programmatic Usage
- Question Types
- Definition Queries
- Relationship Queries
- Semantic Queries
- Security Queries
- Understanding Results
- Result Structure
- Confidence Levels
- Advanced Features
- Hybrid Search Mode
- Multi-Domain Analysis
- Scenario-Based Analysis
- Structural Pattern Search
- Best Practices
- Writing Effective Questions
- Optimizing Performance
- Interpreting Answers
- Workflow Integration
- CI/CD Integration
- Code Review
- Documentation Generation
- Next Steps
Overview¶
CodeGraph answers natural language questions about codebases by combining: - Semantic search - Find code by meaning and intent - Structural search - Traverse call graphs and data flow - LLM synthesis - Generate human-readable answers
Basic Usage¶
Interactive Mode¶
python examples/demo_simple.py
Enter questions at the prompt:
> What does CommitTransaction do?
> Find methods that handle memory allocation
> Show the call chain from executor to storage
Programmatic Usage¶
from src.workflow import MultiScenarioCopilot
copilot = MultiScenarioCopilot()
question = "What methods handle transaction commits?"
result = copilot.run(question)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result.get('confidence', 'N/A')}")
print(f"Intent: {result.get('intent')}")
The run() method accepts optional parameters:
result = copilot.run(
query="Find SQL injection vulnerabilities",
context={"scenario_id": "scenario_2"}, # Force security scenario
language="en", # Response language (en/ru)
)
Question Types¶
Definition Queries¶
Find where code is defined:
Find method 'heap_insert'
Where is AbortTransaction defined?
Show me the RelationGetBufferForTuple function
Relationship Queries¶
Understand code relationships:
What methods call LWLockAcquire?
Find callers of MemoryContextCreate
What does heap_insert call?
Semantic Queries¶
Ask about behavior and purpose:
How does PostgreSQL handle MVCC?
Explain the transaction commit process
What mechanism ensures durability?
Security Queries¶
Find vulnerabilities:
Find potential SQL injection points
Show unsanitized user input paths
Find buffer overflow risks
Understanding Results¶
Result Structure¶
copilot.run() returns a MultiScenarioState dictionary with these key fields:
{
# Core output
"answer": "CommitTransaction finalizes a transaction by...",
"confidence": 0.85, # Intent classification confidence (0.0–1.0)
"intent": "security", # Detected intent (e.g., security, performance)
"scenario_id": "scenario_2", # Executed scenario ID
# Supporting data
"evidence": ["xact.c:1234 — CommitTransaction calls..."],
"cpg_results": [...], # Raw CPG query results
"metadata": {...}, # Scenario-specific metadata
# Classification
"classification_method": "keyword", # "keyword" or "llm"
# Error handling
"error": None, # Error message if any
}
The full MultiScenarioState TypedDict (src/workflow/state.py) contains 21 keys including query, context, language, subsystems, methods, call_graph, retrieved_functions, retry_count, enrichment_config, vector_store, db_path, collection_prefix, and pre_retrieval_results.
Confidence Levels¶
The confidence field reflects the intent classifier’s certainty, not answer quality:
| Level | Meaning |
|---|---|
| > 0.9 | High confidence - keyword match or forced scenario |
| 0.7-0.9 | Good confidence - LLM classification |
| 0.5-0.7 | Moderate - ambiguous intent |
| < 0.5 | Low - fallback to default scenario |
Advanced Features¶
Hybrid Search Mode¶
Combine semantic (ChromaDB) and structural (DuckDB CPG) search:
from src.agents.retriever_agent import RetrieverAgent
from src.retrieval.vector_store import VectorStoreReal
from src.agents.analyzer_agent import AnalyzerAgent
from src.services.cpg import CPGQueryService
# Initialize dependencies
vector_store = VectorStoreReal()
analyzer_agent = AnalyzerAgent()
cpg_service = CPGQueryService()
# Create retriever with hybrid mode
retriever = RetrieverAgent(
vector_store=vector_store,
analyzer_agent=analyzer_agent,
cpg_service=cpg_service, # Enables hybrid retrieval
enable_hybrid=True,
)
# Run hybrid retrieval
results = retriever.retrieve_hybrid(
question="Find memory allocation patterns",
mode="hybrid", # "hybrid", "vector_only", or "graph_only"
query_type="structural", # Hint: "semantic", "structural", "security"
top_k=10,
)
Multi-Domain Analysis¶
Switch between codebases using ProjectManager:
from src.project_manager import ProjectManager
pm = ProjectManager()
# Switch to a registered project (activates DB, collections, domain)
pm.switch_project("postgresql")
# Analyze in the context of this project
from src.workflow import MultiScenarioCopilot
copilot = MultiScenarioCopilot()
result = copilot.run("Find buffer overflow risks")
# Switch to another project
pm.switch_project("linux_kernel")
result = copilot.run("Find use-after-free patterns")
Projects are registered in config.yaml → projects with db_path, source_path, language, and domain fields.
Scenario-Based Analysis¶
Use the copilot for scenario-based analysis (intent is detected automatically):
from src.workflow import MultiScenarioCopilot
copilot = MultiScenarioCopilot()
# Security analysis - intent detected automatically
result = copilot.run("Find SQL injection vulnerabilities")
print(f"Intent: {result.get('intent')}") # → 'security'
# Performance analysis
result = copilot.run("Find functions with high cyclomatic complexity")
print(f"Intent: {result.get('intent')}") # → 'performance'
# Force a specific scenario (e.g., security = scenario_2)
result = copilot.run(
"Analyze authentication module",
context={"scenario_id": "scenario_2"},
)
21 scenarios are available (S01 onboarding through S21 interface_docs_sync). See Scenarios for the full list.
Structural Pattern Search¶
Search for code patterns using GoCPG’s tree-sitter CST matching with CPG-aware constraints:
Using the Patterns CLI Programmatically¶
import asyncio
from src.services.gocpg import GoCPGClient
async def main():
client = GoCPGClient()
# Ad-hoc pattern search (no CPG DB needed)
results = await client.search(pattern="malloc($x)", language="c", max_results=50)
# CPG-aware scan with rules
results = await client.scan(
db_path="data/projects/postgres.duckdb",
rule_id="unchecked-return",
)
print(results)
asyncio.run(main())
Pattern Findings via CPG Query Service¶
from src.services.cpg import CPGQueryService
cpg = CPGQueryService()
# Query persisted pattern findings
findings = cpg.get_pattern_findings(severity="high")
stats = cpg.get_pattern_statistics()
Best Practices¶
Writing Effective Questions¶
Good questions: - “What functions handle memory allocation in the buffer manager?” - “Show the call path from parser to executor” - “Find unsanitized inputs that reach database queries”
Less effective: - “Tell me about the code” (too vague) - “Fix this bug” (action request, not analysis) - “Everything about transactions” (too broad)
Optimizing Performance¶
- Be specific - Narrow questions get faster answers
- Use structural queries - When you know the pattern
- Enable caching - For repeated similar queries
- Limit scope - Add file or subsystem constraints
Interpreting Answers¶
- Check evidence - Verify the code references in
result['evidence'] - Consider confidence - Lower confidence = verify manually
- Follow up - Ask clarifying questions
- Cross-reference - Compare with actual code
Workflow Integration¶
CI/CD Integration¶
# .github/workflows/code-analysis.yml
- name: Run Code Analysis
run: |
python -c "
from src.workflow import MultiScenarioCopilot
copilot = MultiScenarioCopilot()
result = copilot.run('Find potential security issues')
if result.get('error'):
print(f'Analysis error: {result[\"error\"]}')
exit(1)
print(result['answer'])
"
Code Review¶
# Run automated patch review demo
python examples/demo_patch_review.py --db data/projects/myproject.duckdb
# Available flags:
# --db PATH Path to DuckDB CPG database
# --no-dod Disable Definition of Done functionality
# --auto-dod Auto-generate DoD instead of extracting from PR body
# --interactive Enable interactive DoD confirmation prompts
Documentation Generation¶
from src.workflow import MultiScenarioCopilot
copilot = MultiScenarioCopilot()
result = copilot.run("Document the transaction subsystem")
# result['answer'] contains generated documentation
print(result['answer'])
Next Steps¶
- Scenarios - All 21 analysis scenarios
- CLI Guide - Command-line interface
- API Reference - Programmatic access
- Troubleshooting - Common issues