CPG Export Guide¶
This guide covers creating and exporting Code Property Graphs (CPG) to DuckDB for analysis with CodeGraph.
Note: For regular CodeGraph operation, Joern is not required. CPG data is typically pre-exported to DuckDB. This guide is for users who need to create new CPG exports from source code.
Table of Contents¶
- Overview
- Prerequisites
- For New CPG Creation (Optional)
- For Using Existing CPG Data
- Optional
- Install Dependencies
- Quick Start
- CLI Export (Recommended)
- Python API
- CLI Reference
- Parameters
- Common Commands
- Export Process
- Step 1: Schema Initialization
- Step 2: Node Export
- Step 3: Edge Export
- Step 4: Property Graph Creation
- Step 5: Validation
- Checkpoint/Resume
- Check Progress
- Progress States
- Validation
- Automatic Validation
- Manual Validation
- Handling Missing Data
- Incremental Updates
- Performance
- Vector Embeddings
- Schema Reference
- Core Node Tables
- Core Edge Tables
- Full Schema
- Querying the CPG
- SQL Queries
- Property Graph Queries (DuckPGQ)
- Python Client
- Troubleshooting
- Connection Errors
- Out of Memory
- Slow Export
- Missing Nodes
- Performance Tips
- See Also
Overview¶
The CPG export system creates code analysis data in DuckDB format, enabling:
- SQL queries for graph traversal
- Semantic search with vector embeddings
- Security analysis with hypothesis generation
- Incremental updates via git integration
Key Features: - Full CPG Spec v1.1 compliance (22 node types, 20 edge types) - Checkpoint/resume for large codebases - Automatic validation - Property Graph creation for graph queries
Prerequisites¶
For New CPG Creation (Optional)¶
- Joern installation (only if creating new CPG from source code)
- DuckDB 0.9.0+
- Python 3.10+
For Using Existing CPG Data¶
- DuckDB 0.9.0+
- Python 3.10+
- Pre-exported CPG database file (
.duckdb)
Optional¶
- DuckPGQ extension for property graph queries
- sentence-transformers for semantic embeddings
Install Dependencies¶
pip install duckdb cpgqls-client sentence-transformers
Quick Start¶
CLI Export (Recommended)¶
# Full export with validation
python -m src.cpg_export.exporter \
--endpoint localhost:8080 \
--workspace myproject.cpg \
--db cpg.duckdb
# Check export status
python -m src.cpg_export.exporter --db cpg.duckdb --status
# Validate existing database
python -m src.cpg_export.exporter --db cpg.duckdb --validate-only
Python API¶
from src.cpg_export import JoernToDuckDBExporter
# Create exporter
exporter = JoernToDuckDBExporter(
server_endpoint="localhost:8080",
workspace="myproject.cpg",
db_path="cpg.duckdb",
batch_size=10000
)
# Full export with automatic validation
results = exporter.export_full_cpg()
# Check results
print(f"Nodes exported: {sum(results['node_stats'].values())}")
print(f"Edges exported: {sum(results['edge_stats'].values())}")
CLI Reference¶
Parameters¶
| Parameter | Default | Description |
|---|---|---|
--endpoint |
localhost:8080 |
Joern server endpoint |
--workspace |
pg17_full.cpg |
Workspace/CPG name in Joern |
--db |
cpg.duckdb |
Output DuckDB file path |
--batch-size |
10000 |
Records per batch (memory vs speed) |
--limit |
None | Limit records per type (for testing) |
--force |
False | Drop and recreate all tables |
--no-resume |
False | Disable checkpoint resume |
--skip-validation |
False | Skip validation at end |
--status |
False | Show export progress only |
--validate-only |
False | Run validation only |
Common Commands¶
# Resume interrupted export
python -m src.cpg_export.exporter --db cpg.duckdb
# Force fresh export (drops existing data)
python -m src.cpg_export.exporter --db cpg.duckdb --force
# Test with limited data
python -m src.cpg_export.exporter --db cpg.duckdb --limit 1000
# Large codebase with smaller batches
python -m src.cpg_export.exporter --db cpg.duckdb --batch-size 5000
Export Process¶
The exporter runs a 5-step pipeline:
Step 1: Schema Initialization¶
Creates tables for all CPG node and edge types. Tables are created with IF NOT EXISTS to preserve existing data unless --force is used.
Step 2: Node Export¶
Exports all CPG nodes by type:
| Priority | Node Types |
|---|---|
| P0 (Core) | METHOD, CALL, IDENTIFIER, LITERAL, LOCAL, PARAM, RETURN, BLOCK, CONTROL_STRUCTURE |
| P0 (Structure) | FILE, NAMESPACE, NAMESPACE_BLOCK, MEMBER, TYPE, TYPE_DECL |
| P1 | METHOD_PARAMETER_OUT, METHOD_RETURN, FIELD_IDENTIFIER, TYPE_ARGUMENT, TYPE_PARAMETER |
| P2 | JUMP_LABEL, JUMP_TARGET, METHOD_REF, MODIFIER, TYPE_REF, UNKNOWN |
| P3 | BINDING, ANNOTATION |
Step 3: Edge Export¶
Exports all CPG edges:
| Priority | Edge Types |
|---|---|
| P0 (Core) | AST, CFG, CALL, REF, ARGUMENT, RECEIVER, CONDITION |
| P0 (Analysis) | REACHING_DEF, DOMINATE, POST_DOMINATE, CDG, CONTAINS |
| P1 | EVAL_TYPE, INHERITS_FROM, ALIAS_OF |
| P2 | BINDS_TO, PARAMETER_LINK, SOURCE_FILE |
| P3 | TAGGED_BY, BINDS |
Step 4: Property Graph Creation¶
Creates a DuckDB Property Graph named cpg for graph traversal queries:
-- Query using property graph
FROM GRAPH_TABLE(cpg
MATCH (m:METHOD)-[c:CALLS]->(callee:METHOD)
WHERE m.name = 'main'
COLUMNS (m.name AS caller, callee.full_name AS callee)
)
Step 5: Validation¶
Compares counts between Joern and DuckDB to ensure complete export:
======================================================================
CPG EXPORT VALIDATION REPORT
======================================================================
[OK] nodes_method 1234 / 1234 (100.0%)
[OK] nodes_call 45678 / 45678 (100.0%)
[MISSING] nodes_identifier 89000 / 89012 ( 99.9%)
----------------------------------------------------------------------
TOTAL 134912 / 134924 ( 99.9%)
======================================================================
Checkpoint/Resume¶
The exporter automatically saves progress after each batch. If interrupted:
# Simply run again - will resume automatically
python -m src.cpg_export.exporter --db cpg.duckdb
Check Progress¶
# CLI
python -m src.cpg_export.exporter --db cpg.duckdb --status
-- Direct SQL query
SELECT entity_type, status, exported_count, last_offset
FROM export_progress
ORDER BY entity_type;
Progress States¶
| Status | Description |
|---|---|
pending |
Not yet started |
in_progress |
Currently exporting |
completed |
Successfully finished |
failed |
Error occurred |
Validation¶
Automatic Validation¶
Validation runs automatically at the end of export. To skip:
python -m src.cpg_export.exporter --db cpg.duckdb --skip-validation
Manual Validation¶
# CLI
python -m src.cpg_export.exporter --db cpg.duckdb --validate-only
# Python API
from src.cpg_export import validate_export
from src.execution.joern_client import JoernClient
import duckdb
joern = JoernClient("localhost:8080", "myproject.cpg")
conn = duckdb.connect("cpg.duckdb")
results = validate_export(joern, conn, print_report=True)
Handling Missing Data¶
If validation shows missing records:
- Check Joern logs for parse errors
- Re-export specific types if partial failure
- Force recreate if data corruption suspected
# Re-export only nodes
exporter = JoernToDuckDBExporter(...)
exporter.connect_db()
node_stats = exporter.export_nodes_only(limit=None)
# Re-export only edges
edge_stats = exporter.export_edges_only(limit=None)
Incremental Updates¶
For repositories with active development, use incremental export:
from src.cpg_export.incremental_exporter import IncrementalCPGExporter
exporter = IncrementalCPGExporter(
repo_path="/path/to/repo",
db_path="cpg.duckdb",
joern_path="/path/to/joern"
)
# Update from git changes
result = exporter.update_from_git_diff(
from_ref="HEAD~5", # Last 5 commits
to_ref="HEAD"
)
print(f"Status: {result.status}")
print(f"Files changed: {len(result.changed_files)}")
print(f"Nodes updated: {result.nodes_updated}")
print(f"Duration: {result.duration_seconds}s")
Performance¶
| Operation | Full Export | Incremental |
|---|---|---|
| 100K LOC | ~20 min | ~2 min |
| 1M LOC | ~3 hours | ~10 min |
Vector Embeddings¶
Add semantic embeddings for code search:
from src.cpg_export.add_vector_embeddings import add_embeddings_to_methods
# Add embeddings to methods table
add_embeddings_to_methods(
db_path="cpg.duckdb",
model_name="all-MiniLM-L6-v2",
batch_size=100
)
# Semantic search
from src.cpg_export.add_vector_embeddings import find_similar_methods
results = find_similar_methods(
db_path="cpg.duckdb",
query="parse user input safely",
top_k=10
)
Schema Reference¶
Core Node Tables¶
| Table | Key Columns | Description |
|---|---|---|
nodes_method |
id, name, full_name, filename, line_number | Functions/methods |
nodes_call |
id, name, method_full_name, filename, line_number | Call sites |
nodes_identifier |
id, name, type_full_name, line_number | Variable references |
nodes_literal |
id, code, type_full_name | Literal values |
nodes_local |
id, name, type_full_name | Local variables |
nodes_param |
id, name, type_full_name, index | Parameters |
nodes_return |
id, code, line_number | Return statements |
nodes_block |
id, type_full_name, line_number | Code blocks |
nodes_control_structure |
id, control_structure_type, line_number | if/for/while |
nodes_type_decl |
id, name, full_name, filename | Type declarations |
nodes_file |
id, name, hash | Source files |
Core Edge Tables¶
| Table | Columns | Description |
|---|---|---|
edges_ast |
src, dst | AST parent-child |
edges_cfg |
src, dst | Control flow |
edges_call |
src, dst | Method calls |
edges_ref |
src, dst | Variable references |
edges_reaching_def |
src, dst, variable | Data flow |
edges_cdg |
src, dst | Control dependence |
edges_dominate |
src, dst | Dominance |
Full Schema¶
See src/cpg_export/duckdb_cpg_schema.md for complete schema documentation.
Querying the CPG¶
SQL Queries¶
-- Find all methods in a file
SELECT name, full_name, line_number
FROM nodes_method
WHERE filename LIKE '%auth%'
ORDER BY line_number;
-- Find calls to dangerous functions
SELECT nc.code, nc.filename, nc.line_number
FROM nodes_call nc
WHERE nc.name IN ('system', 'exec', 'eval');
-- Count nodes by type
SELECT 'METHOD' as type, COUNT(*) as cnt FROM nodes_method
UNION ALL
SELECT 'CALL', COUNT(*) FROM nodes_call
UNION ALL
SELECT 'IDENTIFIER', COUNT(*) FROM nodes_identifier;
Property Graph Queries (DuckPGQ)¶
-- Find call chains
FROM GRAPH_TABLE(cpg
MATCH (caller:METHOD)-[c:CALLS]->(callee:METHOD)
WHERE caller.name = 'process_input'
COLUMNS (
caller.full_name AS caller,
callee.full_name AS callee
)
)
LIMIT 100;
-- Find data flow paths
FROM GRAPH_TABLE(cpg
MATCH (src:IDENTIFIER)-[:REACHING_DEF*1..5]->(sink:CALL)
WHERE sink.name = 'execute'
COLUMNS (
src.name AS source_var,
sink.code AS sink_call,
sink.line_number AS line
)
)
Python Client¶
from src.cpg_export.duckdb_cpg_client_v2 import DuckDBCPGClient
client = DuckDBCPGClient("cpg.duckdb")
# Find methods by pattern
methods = client.find_methods_by_name("parse%")
# Get call graph for a method
callgraph = client.get_callgraph("UserController::authenticate")
# Get statistics
stats = client.get_stats()
print(f"Total methods: {stats['nodes_method']}")
print(f"Total calls: {stats['nodes_call']}")
Troubleshooting¶
Connection Errors¶
Error: Could not connect to Joern server
Solution: Verify Joern is running and accessible:
# Test connection
curl http://localhost:8080/result
Out of Memory¶
Error: Database out of memory
Solutions:
1. Reduce batch size: --batch-size 5000
2. Use --limit for testing
3. Close other applications
Slow Export¶
For large codebases (>1M LOC):
- Use incremental export for updates
- Run overnight for initial export
- Consider splitting by directory
Missing Nodes¶
If validation shows missing nodes:
- Check Joern parse logs for errors
- Verify workspace is correctly opened
- Try force recreate:
--force
Performance Tips¶
| Codebase Size | Batch Size | Estimated Time |
|---|---|---|
| <50K LOC | 10000 | <5 min |
| 50K-200K LOC | 10000 | 5-30 min |
| 200K-1M LOC | 5000 | 30 min - 3 hours |
| >1M LOC | 2000 | 3+ hours |
Optimizations: - SSD storage for database - Adequate RAM (8GB+ for large codebases) - Run validation separately if needed - Use incremental exports for updates
See Also¶
- SQL Query Cookbook - Example queries
- Schema Reference - Database schema reference
- Joern Documentation
- DuckDB Documentation
- CPG Specification v1.1