CPG Export Guide¶

This guide covers creating and exporting Code Property Graphs (CPG) to DuckDB for analysis with CodeGraph.

Note: For regular CodeGraph operation, Joern is not required. CPG data is typically pre-exported to DuckDB. This guide is for users who need to create new CPG exports from source code.

Table of Contents¶

Overview
Prerequisites
For New CPG Creation (Optional)
For Using Existing CPG Data
Optional
Install Dependencies
Quick Start
CLI Export (Recommended)
Python API
CLI Reference
Parameters
Common Commands
Export Process
Step 1: Schema Initialization
Step 2: Node Export
Step 3: Edge Export
Step 4: Property Graph Creation
Step 5: Validation
Checkpoint/Resume
Check Progress
Progress States
Validation
Automatic Validation
Manual Validation
Handling Missing Data
Incremental Updates
Performance
Vector Embeddings
Schema Reference
Core Node Tables
Core Edge Tables
Full Schema
Querying the CPG
SQL Queries
Property Graph Queries (DuckPGQ)
Python Client
Troubleshooting
Connection Errors
Out of Memory
Slow Export
Missing Nodes
Performance Tips
See Also

Overview¶

The CPG export system creates code analysis data in DuckDB format, enabling:

SQL queries for graph traversal
Semantic search with vector embeddings
Security analysis with hypothesis generation
Incremental updates via git integration

Key Features: - Full CPG Spec v1.1 compliance (22 node types, 20 edge types) - Checkpoint/resume for large codebases - Automatic validation - Property Graph creation for graph queries

Prerequisites¶

For New CPG Creation (Optional)¶

Joern installation (only if creating new CPG from source code)
DuckDB 0.9.0+
Python 3.10+

For Using Existing CPG Data¶

DuckDB 0.9.0+
Python 3.10+
Pre-exported CPG database file (.duckdb)

Optional¶

DuckPGQ extension for property graph queries
sentence-transformers for semantic embeddings

Install Dependencies¶

pip install duckdb cpgqls-client sentence-transformers

Quick Start¶

CLI Export (Recommended)¶

# Full export with validation
python -m src.cpg_export.exporter \
    --endpoint localhost:8080 \
    --workspace myproject.cpg \
    --db cpg.duckdb

# Check export status
python -m src.cpg_export.exporter --db cpg.duckdb --status

# Validate existing database
python -m src.cpg_export.exporter --db cpg.duckdb --validate-only

Python API¶

from src.cpg_export import JoernToDuckDBExporter

# Create exporter
exporter = JoernToDuckDBExporter(
    server_endpoint="localhost:8080",
    workspace="myproject.cpg",
    db_path="cpg.duckdb",
    batch_size=10000
)

# Full export with automatic validation
results = exporter.export_full_cpg()

# Check results
print(f"Nodes exported: {sum(results['node_stats'].values())}")
print(f"Edges exported: {sum(results['edge_stats'].values())}")

CLI Reference¶

Parameters¶

Parameter	Default	Description
`--endpoint`	`localhost:8080`	Joern server endpoint
`--workspace`	`pg17_full.cpg`	Workspace/CPG name in Joern
`--db`	`cpg.duckdb`	Output DuckDB file path
`--batch-size`	`10000`	Records per batch (memory vs speed)
`--limit`	None	Limit records per type (for testing)
`--force`	False	Drop and recreate all tables
`--no-resume`	False	Disable checkpoint resume
`--skip-validation`	False	Skip validation at end
`--status`	False	Show export progress only
`--validate-only`	False	Run validation only

Common Commands¶

# Resume interrupted export
python -m src.cpg_export.exporter --db cpg.duckdb

# Force fresh export (drops existing data)
python -m src.cpg_export.exporter --db cpg.duckdb --force

# Test with limited data
python -m src.cpg_export.exporter --db cpg.duckdb --limit 1000

# Large codebase with smaller batches
python -m src.cpg_export.exporter --db cpg.duckdb --batch-size 5000

Export Process¶

The exporter runs a 5-step pipeline:

Step 1: Schema Initialization¶

Creates tables for all CPG node and edge types. Tables are created with IF NOT EXISTS to preserve existing data unless --force is used.

Step 2: Node Export¶

Exports all CPG nodes by type:

Priority	Node Types
P0 (Core)	METHOD, CALL, IDENTIFIER, LITERAL, LOCAL, PARAM, RETURN, BLOCK, CONTROL_STRUCTURE
P0 (Structure)	FILE, NAMESPACE, NAMESPACE_BLOCK, MEMBER, TYPE, TYPE_DECL
P1	METHOD_PARAMETER_OUT, METHOD_RETURN, FIELD_IDENTIFIER, TYPE_ARGUMENT, TYPE_PARAMETER
P2	JUMP_LABEL, JUMP_TARGET, METHOD_REF, MODIFIER, TYPE_REF, UNKNOWN
P3	BINDING, ANNOTATION

Step 3: Edge Export¶

Exports all CPG edges:

Priority	Edge Types
P0 (Core)	AST, CFG, CALL, REF, ARGUMENT, RECEIVER, CONDITION
P0 (Analysis)	REACHING_DEF, DOMINATE, POST_DOMINATE, CDG, CONTAINS
P1	EVAL_TYPE, INHERITS_FROM, ALIAS_OF
P2	BINDS_TO, PARAMETER_LINK, SOURCE_FILE
P3	TAGGED_BY, BINDS

Step 4: Property Graph Creation¶

Creates a DuckDB Property Graph named cpg for graph traversal queries:

-- Query using property graph
FROM GRAPH_TABLE(cpg
    MATCH (m:METHOD)-[c:CALLS]->(callee:METHOD)
    WHERE m.name = 'main'
    COLUMNS (m.name AS caller, callee.full_name AS callee)
)

Step 5: Validation¶

Compares counts between Joern and DuckDB to ensure complete export:

======================================================================
CPG EXPORT VALIDATION REPORT
======================================================================
[OK]       nodes_method                      1234 /     1234 (100.0%)
[OK]       nodes_call                       45678 /    45678 (100.0%)
[MISSING]  nodes_identifier                 89000 /    89012 ( 99.9%)
----------------------------------------------------------------------
TOTAL                                      134912 /   134924 ( 99.9%)
======================================================================

Checkpoint/Resume¶

The exporter automatically saves progress after each batch. If interrupted:

# Simply run again - will resume automatically
python -m src.cpg_export.exporter --db cpg.duckdb

Check Progress¶

# CLI
python -m src.cpg_export.exporter --db cpg.duckdb --status

-- Direct SQL query
SELECT entity_type, status, exported_count, last_offset
FROM export_progress
ORDER BY entity_type;

Progress States¶

Status	Description
`pending`	Not yet started
`in_progress`	Currently exporting
`completed`	Successfully finished
`failed`	Error occurred

Validation¶

Automatic Validation¶

Validation runs automatically at the end of export. To skip:

python -m src.cpg_export.exporter --db cpg.duckdb --skip-validation

Manual Validation¶

# CLI
python -m src.cpg_export.exporter --db cpg.duckdb --validate-only

# Python API
from src.cpg_export import validate_export
from src.execution.joern_client import JoernClient
import duckdb

joern = JoernClient("localhost:8080", "myproject.cpg")
conn = duckdb.connect("cpg.duckdb")

results = validate_export(joern, conn, print_report=True)

Handling Missing Data¶

If validation shows missing records:

Check Joern logs for parse errors
Re-export specific types if partial failure
Force recreate if data corruption suspected

# Re-export only nodes
exporter = JoernToDuckDBExporter(...)
exporter.connect_db()
node_stats = exporter.export_nodes_only(limit=None)

# Re-export only edges
edge_stats = exporter.export_edges_only(limit=None)

Incremental Updates¶

For repositories with active development, use incremental export:

from src.cpg_export.incremental_exporter import IncrementalCPGExporter

exporter = IncrementalCPGExporter(
    repo_path="/path/to/repo",
    db_path="cpg.duckdb",
    joern_path="/path/to/joern"
)

# Update from git changes
result = exporter.update_from_git_diff(
    from_ref="HEAD~5",  # Last 5 commits
    to_ref="HEAD"
)

print(f"Status: {result.status}")
print(f"Files changed: {len(result.changed_files)}")
print(f"Nodes updated: {result.nodes_updated}")
print(f"Duration: {result.duration_seconds}s")

Performance¶

Operation	Full Export	Incremental
100K LOC	~20 min	~2 min
1M LOC	~3 hours	~10 min

Vector Embeddings¶

Add semantic embeddings for code search:

from src.cpg_export.add_vector_embeddings import add_embeddings_to_methods

# Add embeddings to methods table
add_embeddings_to_methods(
    db_path="cpg.duckdb",
    model_name="all-MiniLM-L6-v2",
    batch_size=100
)

# Semantic search
from src.cpg_export.add_vector_embeddings import find_similar_methods

results = find_similar_methods(
    db_path="cpg.duckdb",
    query="parse user input safely",
    top_k=10
)

Schema Reference¶

Core Node Tables¶

Table	Key Columns	Description
`nodes_method`	id, name, full_name, filename, line_number	Functions/methods
`nodes_call`	id, name, method_full_name, filename, line_number	Call sites
`nodes_identifier`	id, name, type_full_name, line_number	Variable references
`nodes_literal`	id, code, type_full_name	Literal values
`nodes_local`	id, name, type_full_name	Local variables
`nodes_param`	id, name, type_full_name, index	Parameters
`nodes_return`	id, code, line_number	Return statements
`nodes_block`	id, type_full_name, line_number	Code blocks
`nodes_control_structure`	id, control_structure_type, line_number	if/for/while
`nodes_type_decl`	id, name, full_name, filename	Type declarations
`nodes_file`	id, name, hash	Source files

Core Edge Tables¶

Table	Columns	Description
`edges_ast`	src, dst	AST parent-child
`edges_cfg`	src, dst	Control flow
`edges_call`	src, dst	Method calls
`edges_ref`	src, dst	Variable references
`edges_reaching_def`	src, dst, variable	Data flow
`edges_cdg`	src, dst	Control dependence
`edges_dominate`	src, dst	Dominance

Full Schema¶

See src/cpg_export/duckdb_cpg_schema.md for complete schema documentation.

Querying the CPG¶

SQL Queries¶

-- Find all methods in a file
SELECT name, full_name, line_number
FROM nodes_method
WHERE filename LIKE '%auth%'
ORDER BY line_number;

-- Find calls to dangerous functions
SELECT nc.code, nc.filename, nc.line_number
FROM nodes_call nc
WHERE nc.name IN ('system', 'exec', 'eval');

-- Count nodes by type
SELECT 'METHOD' as type, COUNT(*) as cnt FROM nodes_method
UNION ALL
SELECT 'CALL', COUNT(*) FROM nodes_call
UNION ALL
SELECT 'IDENTIFIER', COUNT(*) FROM nodes_identifier;

Property Graph Queries (DuckPGQ)¶

-- Find call chains
FROM GRAPH_TABLE(cpg
    MATCH (caller:METHOD)-[c:CALLS]->(callee:METHOD)
    WHERE caller.name = 'process_input'
    COLUMNS (
        caller.full_name AS caller,
        callee.full_name AS callee
    )
)
LIMIT 100;

-- Find data flow paths
FROM GRAPH_TABLE(cpg
    MATCH (src:IDENTIFIER)-[:REACHING_DEF*1..5]->(sink:CALL)
    WHERE sink.name = 'execute'
    COLUMNS (
        src.name AS source_var,
        sink.code AS sink_call,
        sink.line_number AS line
    )
)

Python Client¶

from src.cpg_export.duckdb_cpg_client_v2 import DuckDBCPGClient

client = DuckDBCPGClient("cpg.duckdb")

# Find methods by pattern
methods = client.find_methods_by_name("parse%")

# Get call graph for a method
callgraph = client.get_callgraph("UserController::authenticate")

# Get statistics
stats = client.get_stats()
print(f"Total methods: {stats['nodes_method']}")
print(f"Total calls: {stats['nodes_call']}")

Troubleshooting¶

Connection Errors¶

Error: Could not connect to Joern server

Solution: Verify Joern is running and accessible:

# Test connection
curl http://localhost:8080/result

Out of Memory¶

Error: Database out of memory

Solutions: 1. Reduce batch size: --batch-size 5000 2. Use --limit for testing 3. Close other applications

Slow Export¶

For large codebases (>1M LOC):

Use incremental export for updates
Run overnight for initial export
Consider splitting by directory

Missing Nodes¶

If validation shows missing nodes:

Check Joern parse logs for errors
Verify workspace is correctly opened
Try force recreate: --force

Performance Tips¶

Codebase Size	Batch Size	Estimated Time
<50K LOC	10000	<5 min
50K-200K LOC	10000	5-30 min
200K-1M LOC	5000	30 min - 3 hours
>1M LOC	2000	3+ hours

Optimizations: - SSD storage for database - Adequate RAM (8GB+ for large codebases) - Run validation separately if needed - Use incremental exports for updates

CPG Export Guide

CPG Export Guide¶

Table of Contents¶

Overview¶

Prerequisites¶

For New CPG Creation (Optional)¶

For Using Existing CPG Data¶

Optional¶

Install Dependencies¶

Quick Start¶

CLI Export (Recommended)¶

Python API¶

CLI Reference¶

Parameters¶

Common Commands¶

Export Process¶

Step 1: Schema Initialization¶

Step 2: Node Export¶

Step 3: Edge Export¶

Step 4: Property Graph Creation¶

Step 5: Validation¶

Checkpoint/Resume¶

Check Progress¶

Progress States¶

Validation¶

Automatic Validation¶

Manual Validation¶

Handling Missing Data¶

Incremental Updates¶

Performance¶

Vector Embeddings¶

Schema Reference¶

Core Node Tables¶

Core Edge Tables¶

Full Schema¶

Querying the CPG¶

SQL Queries¶

Property Graph Queries (DuckPGQ)¶

Python Client¶

Troubleshooting¶

Connection Errors¶

Out of Memory¶

Slow Export¶

Missing Nodes¶

Performance Tips¶

See Also¶