DuckDB CPG Schema Design (CPG Spec v1.1)¶
Table of Contents¶
- Table of Contents
- Overview
- Core Design Principles
- Node Tables
- nodes_method
- nodes_call
- nodes_identifier
- nodes_field_identifier
- nodes_literal
- nodes_local
- nodes_param
- nodes_method_parameter_out
- nodes_method_return
- nodes_return
- nodes_block
- nodes_control_structure
- nodes_member
- nodes_type_decl
- nodes_metadata
- nodes_file
- nodes_namespace_block
- nodes_method_ref
- nodes_type_ref
- nodes_unknown
- nodes_jump_target
- nodes_type_param
- nodes_type_argument
- nodes_binding
- nodes_closure_binding
- nodes_comment
- nodes_modifier
- nodes_annotation
- nodes_type
- nodes_jump_label
- nodes_tag_v2
- nodes_finding
- nodes_macro
- nodes_macro_param
- nodes_namespace
- nodes_annotation_literal
- nodes_annotation_parameter
- nodes_annotation_parameter_assign
- nodes_key_value_pair
- nodes_location
- nodes_config_file
- nodes_import
- nodes_collection_decl
- Edge Tables
- edges_ast
- edges_cfg
- edges_call
- edges_ref
- edges_reaching_def
- edges_argument
- edges_receiver
- edges_condition
- edges_dominate
- edges_post_dominate
- edges_cdg
- edges_binds
- edges_binds_to
- edges_source_file
- edges_alias_of
- edges_inherits_from
- edges_contains
- edges_eval_type
- edges_tagged_by
- edges_parameter_link
- edges_pdg
- edges_ddg
- edges_capture
- edges_vtable
- edges_documents
- Views
- call_containment
- method_docstrings
- State Tables
- Domain and Pattern Tables
- Property Graph Definition
- Step 1: Create Materialized Unified Nodes Table
- Step 2: Create Comprehensive Property Graph
- Example Queries
- Standard SQL Query: Find all calls to a specific method
- DuckDB PGQ Query: Find direct call chains (caller -> callee)
- DuckDB PGQ Query: Find methods and their AST children
- DuckDB PGQ Query: Data flow paths using REACHING_DEF
- DuckDB PGQ Query: CFG paths (control flow)
- DuckDB PGQ Query: Find all identifiers and their references
- DuckDB PGQ Query: Type hierarchy (inheritance)
- Combined Query: Methods with most incoming calls
- Performance Considerations
- Schema Version
- Changelog
- v7.0 (2026-02-28) - Full schema.go alignment
- v6.0 (2026-02-26) - Pre-computed Metrics & Pattern Flags
- v5.0 (2025-11-16) - Complete Compliance
- v4.0 (2025-11-16) - Namespace and File Support
- v3.0 (2025-11-16) - OOP Support
- v2.0 (2025-11-16) - Critical Updates
- v1.0 (2025-11-15) - Initial Release
- Extension: Semantic Tag System
- Overview
- nodes_tag_v2
- edges_tagged_by (Extension)
- Tag Categories
- Example Tag Queries
- Tag Statistics
- Integration Notes
Overview¶
This schema implements the Code Property Graph specification v1.1 in DuckDB using the duckpgq extension for efficient property graph queries.
Core Design Principles¶
- Node Tables: Separate tables for each major node type (METHOD, CALL, IDENTIFIER, etc.)
- Edge Tables: Separate tables for each edge type (AST, CFG, CALL, REF, REACHING_DEF, etc.)
- Property Graph: Use duckpgq’s CREATE PROPERTY GRAPH for unified graph queries
- Efficient Indexing: B-tree indexes on id, full_name, and frequently queried properties
- Batch Processing: Support for large-scale CPG imports (50K+ methods)
GoCPG vs Legacy Schema¶
GoCPG generates DuckDB natively and produces additional tables not present in legacy Joern exports:
| GoCPG-only table | Purpose |
|---|---|
nodes_import |
Import/include statements with resolved paths |
nodes_finding |
Static analysis findings (security, quality) |
nodes_macro |
Preprocessor macros (C/C++) |
edges_ddg |
Data Dependence Graph edges |
edges_pdg |
Program Dependence Graph edges |
edges_contains |
Containment edges (method→node) |
edges_parameter_link |
Parameter to argument links |
edges_eval_type |
Type evaluation edges |
GoCPG also pre-computes cyclomatic_complexity on nodes_method and method_id on nodes_param.
Note on
cpg_nodes: The materializedcpg_nodestable (UNION ALL of all node tables) is used only for the DuckPGQ Property Graph definition. It should not be queried directly in application code — use the specificnodes_*tables instead. GoCPG databases do not generatecpg_nodes; the compatibility layer creates it on demand when Property Graph queries are needed.
Node Tables¶
nodes_method¶
Core table for function/method declarations.
CREATE TABLE nodes_method (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
full_name VARCHAR NOT NULL,
signature VARCHAR,
filename VARCHAR,
line_number INTEGER,
line_number_end INTEGER,
column_number INTEGER,
column_number_end INTEGER,
code VARCHAR,
is_external BOOLEAN DEFAULT FALSE,
ast_parent_type VARCHAR,
ast_parent_full_name VARCHAR,
loc INTEGER,
parameter_count INTEGER,
-- Pre-computed pattern flags (v6.0)
has_disabled_code BOOLEAN DEFAULT FALSE,
has_deprecated BOOLEAN DEFAULT FALSE,
has_todo_fixme BOOLEAN DEFAULT FALSE,
has_debug_code BOOLEAN DEFAULT FALSE,
-- Classification flags (v6.0)
is_test BOOLEAN DEFAULT FALSE,
is_entry_point BOOLEAN DEFAULT FALSE,
is_nested BOOLEAN DEFAULT FALSE,
-- Pre-computed metrics (v6.0)
cyclomatic_complexity INTEGER DEFAULT 0,
fan_in INTEGER DEFAULT 0,
fan_out INTEGER DEFAULT 0,
-- Embedding (populated externally by ChromaDB import)
embedding FLOAT[],
embedding_model VARCHAR,
embedding_updated_at TIMESTAMP,
-- AST hash (for incremental change detection)
ast_hash VARCHAR
);
Properties (from CPG spec): - FULL_NAME, NAME, SIGNATURE: Method identification - IS_EXTERNAL: Whether defined in source - AST_PARENT_FULL_NAME, AST_PARENT_TYPE: Type context - FILENAME, LINE_NUMBER, COLUMN_NUMBER: Source location - CODE: Method source code (truncated to 1000 chars) - LOC: Lines of code - PARAMETER_COUNT: Number of parameters - AST_HASH: SHA256 hash of AST structure (for incremental change detection)
Pre-computed Pattern Flags (v6.0 – GoCPG):
- HAS_DISABLED_CODE: Contains #if 0, if false, or commented-out blocks
- HAS_DEPRECATED: Contains @Deprecated, [[deprecated]], or similar annotations
- HAS_TODO_FIXME: Contains TODO, FIXME, HACK, or XXX comments
- HAS_DEBUG_CODE: Contains debug prints, console.log, or debugging statements
Classification Flags (v6.0 – GoCPG): - IS_TEST: Method identified as a test function (cross-language detection) - IS_ENTRY_POINT: Method is a public API entry point (exported, HTTP handler, main, etc.) - IS_NESTED: Method is defined inside another method (closure/inner function)
Pre-computed Metrics (v6.0 – GoCPG): - CYCLOMATIC_COMPLEXITY: McCabe cyclomatic complexity metric - FAN_IN: Number of methods that call this method - FAN_OUT: Number of methods called by this method
Pre-computed Pattern Flag Queries:
-- Find complex methods with high fan-out (potential god methods)
SELECT full_name, cyclomatic_complexity, fan_out
FROM nodes_method
WHERE cyclomatic_complexity > 20 AND fan_out > 15
ORDER BY cyclomatic_complexity DESC;
-- Find deprecated methods still being called
SELECT m.full_name, m.fan_in
FROM nodes_method m
WHERE m.has_deprecated = TRUE AND m.fan_in > 0;
-- Find test vs production code ratio
SELECT is_test, COUNT(*) as count
FROM nodes_method
GROUP BY is_test;
-- Find public API entry points with high complexity
SELECT full_name, cyclomatic_complexity, fan_in
FROM nodes_method
WHERE is_entry_point = TRUE AND cyclomatic_complexity > 10
ORDER BY cyclomatic_complexity DESC;
nodes_call¶
Represents function/method invocations.
CREATE TABLE nodes_call (
id BIGINT NOT NULL,
name VARCHAR,
method_full_name VARCHAR,
signature VARCHAR,
dispatch_type VARCHAR,
code VARCHAR,
line_number INTEGER,
column_number INTEGER,
argument_index INTEGER,
filename VARCHAR,
type_full_name VARCHAR,
containing_method_id BIGINT,
callee_method_id BIGINT,
type_origin VARCHAR DEFAULT '',
type_confidence DOUBLE DEFAULT 0.0,
embedding FLOAT[],
embedding_model VARCHAR,
embedding_updated_at TIMESTAMP
);
Properties (from CPG spec): - METHOD_FULL_NAME: Target method - DISPATCH_TYPE: Call mechanism (STATIC_DISPATCH, DYNAMIC_DISPATCH) - TYPE_FULL_NAME: Return type - SIGNATURE: Parameter types - CONTAINING_METHOD_ID: ID of the method containing this call site - CALLEE_METHOD_ID: ID of the resolved callee method - TYPE_ORIGIN: Source of type inference (e.g., TypeRecoveryPass) - TYPE_CONFIDENCE: Confidence score for type inference (0.0-1.0)
nodes_identifier¶
Variable and reference names.
CREATE TABLE nodes_identifier (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
type_full_name VARCHAR,
code VARCHAR,
order_index INTEGER,
argument_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR,
containing_method_id BIGINT
);
Properties (from CPG spec): - NAME: Variable identifier - TYPE_FULL_NAME: Variable type - CONTAINING_METHOD_ID: ID of the containing method
nodes_field_identifier¶
Field access identifiers (OOP - e.g., obj.field).
CREATE TABLE nodes_field_identifier (
id BIGINT NOT NULL,
canonical_name VARCHAR NOT NULL,
code VARCHAR,
order_index INTEGER,
argument_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR
);
Properties (from CPG spec): - CANONICAL_NAME: Normalized name (e.g., “myField” for both a.myField and b.myField) - CODE: Field access as written (e.g., “obj.field”) - Purpose: Identify field accesses in OOP code (critical for alias analysis)
Example:
struct Point { int x, y; };
Point p;
p.x = 10; // <- "x" is FIELD_IDENTIFIER with canonical_name="x"
nodes_literal¶
Constant values.
CREATE TABLE nodes_literal (
id BIGINT NOT NULL,
code VARCHAR NOT NULL,
type_full_name VARCHAR,
order_index INTEGER,
argument_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR,
containing_method_id BIGINT
);
Properties (from CPG spec): - TYPE_FULL_NAME: Literal type - CODE: Literal value
nodes_local¶
Local variable declarations.
CREATE TABLE nodes_local (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
code VARCHAR,
type_full_name VARCHAR,
order_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR,
containing_method_id BIGINT
);
Properties (from CPG spec): - NAME: Local variable name - TYPE_FULL_NAME: Declared type
nodes_param¶
Method parameters (formal parameters).
CREATE TABLE nodes_param (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
code VARCHAR,
type_full_name VARCHAR,
index INTEGER,
is_variadic BOOLEAN DEFAULT FALSE,
evaluation_strategy VARCHAR,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR,
method_id BIGINT,
parent_param_id BIGINT,
typedef_id BIGINT,
struct_member_id BIGINT
);
Properties (from CPG spec): - INDEX: Parameter position - IS_VARIADIC: Variable-length parameter - EVALUATION_STRATEGY: BY_VALUE, BY_REFERENCE, BY_SHARING - METHOD_ID: ID of the containing method - PARENT_PARAM_ID: ID of the parent parameter (for nested/destructured params) - TYPEDEF_ID: ID of the typedef node (for C typedef resolution) - STRUCT_MEMBER_ID: ID of the struct member node (for C struct member resolution)
nodes_method_parameter_out¶
Method output parameters (for SSA/data flow analysis). Go multiple returns.
CREATE TABLE nodes_method_parameter_out (
id BIGINT NOT NULL,
name VARCHAR,
type_full_name VARCHAR,
code VARCHAR,
index INTEGER,
evaluation_strategy VARCHAR,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR,
method_id BIGINT
);
Properties (from CPG spec): - Corresponds to METHOD_PARAMETER_IN for data flow - INDEX: Parameter position (matches input parameter) - EVALUATION_STRATEGY: BY_VALUE, BY_REFERENCE, BY_SHARING - METHOD_ID: ID of the containing method - Required for SSA (Static Single Assignment) analysis
nodes_method_return¶
Method return parameter (formal return).
CREATE TABLE nodes_method_return (
id BIGINT NOT NULL,
type_full_name VARCHAR,
code VARCHAR,
evaluation_strategy VARCHAR,
order_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR,
method_id BIGINT
);
Properties (from CPG spec): - TYPE_FULL_NAME: Return type - CODE: Typically “RET” or empty - EVALUATION_STRATEGY: How return value is passed - METHOD_ID: ID of the containing method - One per method (formal return parameter, not return statement)
nodes_return¶
Return statements (actual return in code).
CREATE TABLE nodes_return (
id BIGINT NOT NULL,
code VARCHAR,
order_index INTEGER,
argument_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR,
containing_method_id BIGINT
);
Note: This is the RETURN statement node. Different from METHOD_RETURN which is the formal return parameter.
nodes_block¶
Compound statements (code blocks).
CREATE TABLE nodes_block (
id BIGINT NOT NULL,
type_full_name VARCHAR,
code VARCHAR,
order_index INTEGER,
argument_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR,
containing_method_id BIGINT
);
nodes_control_structure¶
Control flow constructs (if, while, for, etc.).
CREATE TABLE nodes_control_structure (
id BIGINT NOT NULL,
control_structure_type VARCHAR NOT NULL,
parser_type_name VARCHAR,
code VARCHAR,
order_index INTEGER,
argument_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR,
containing_method_id BIGINT
);
Properties (from CPG spec): - CONTROL_STRUCTURE_TYPE: BREAK, CONTINUE, DO, WHILE, FOR, GOTO, IF, ELSE, TRY, THROW, SWITCH
nodes_member¶
Type members (fields of classes/structs).
CREATE TABLE nodes_member (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
code VARCHAR,
type_full_name VARCHAR,
order_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR
);
Properties (from CPG spec): - NAME: Member name (e.g., “x”, “y”) - TYPE_FULL_NAME: Member type (e.g., “int”, “std::string”) - AST_PARENT_FULL_NAME: Containing type - Purpose: Represent fields/members of classes/structs
Example:
struct Point {
int x; // <- MEMBER: name="x", type_full_name="int"
int y; // <- MEMBER: name="y", type_full_name="int"
};
nodes_type_decl¶
Type declarations (classes, structs).
CREATE TABLE nodes_type_decl (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
full_name VARCHAR NOT NULL,
alias_type_full_name VARCHAR,
inherits_from_type_full_name VARCHAR[],
is_external BOOLEAN DEFAULT FALSE,
filename VARCHAR,
line_number INTEGER,
ast_parent_type VARCHAR,
ast_parent_full_name VARCHAR,
code VARCHAR
);
Properties (from CPG spec): - FULL_NAME, NAME: Type identification - IS_EXTERNAL: Whether defined in source - INHERITS_FROM_TYPE_FULL_NAME: Base types (array) - ALIAS_TYPE_FULL_NAME: Type alias - AST_PARENT_TYPE, AST_PARENT_FULL_NAME: Parent type context
nodes_metadata¶
CPG metadata (required by spec).
CREATE TABLE nodes_metadata (
id BIGINT NOT NULL,
language VARCHAR NOT NULL,
version VARCHAR DEFAULT '1.1',
root VARCHAR,
overlays VARCHAR[],
hash VARCHAR
);
Properties (from CPG spec): - LANGUAGE: Source language - VERSION: CPG spec version (default “1.1”) - ROOT: Root path - OVERLAYS: Applied overlays - HASH: Content hash
nodes_file¶
Source file nodes (required by spec).
CREATE TABLE nodes_file (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
hash VARCHAR,
ast_hash VARCHAR,
content VARCHAR,
size_bytes BIGINT,
language VARCHAR
);
Properties (from CPG spec): - NAME: File path relative to root (from METADATA.ROOT) - HASH: SHA-256 or MD5 hash of file content - AST_HASH: Hash of AST structure (for incremental change detection) - CONTENT: Optional - full source code of the file - SIZE_BYTES: File size in bytes - LANGUAGE: Detected programming language
Purpose: - Index for looking up all code elements by file - Root nodes of Abstract Syntax Trees (AST) - Source file metadata storage - Required for SOURCE_FILE edges
Example:
name="src/main.c", hash="abc123...", order_index=0
Note: Each source file SHOULD have exactly one FILE node. FILE nodes serve as AST roots and allow navigation from file to all contained code elements.
nodes_namespace_block¶
Namespace block nodes (namespace scopes).
CREATE TABLE nodes_namespace_block (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
full_name VARCHAR NOT NULL,
filename VARCHAR,
order_index INTEGER
);
Properties (from CPG spec): - NAME: Human-readable namespace name (e.g., “foo.bar”) - Dot-separated: “foo.bar” means namespace “bar” inside “foo” - FULL_NAME: Unique identifier combining file and namespace - Should include file info to ensure uniqueness - FILENAME: Source file containing this namespace block - ORDER_INDEX: Position in parent AST
Purpose:
- Represent namespace blocks (C++ namespace{}, Java package)
- Structure code into logical units
- Allow namespace-based code queries
- Support multi-file namespace analysis
Examples:
// C++:
namespace foo {
namespace bar {
// code
}
}
// NAME="foo.bar", FULL_NAME="main.cpp:foo.bar"
// Java:
package com.example.myapp;
// NAME="com.example.myapp", FULL_NAME="Main.java:com.example.myapp"
Note: NAMESPACE nodes (indices) are auto-generated from NAMESPACE_BLOCK nodes when CPG is loaded.
nodes_method_ref¶
Method reference nodes (method as value).
CREATE TABLE nodes_method_ref (
id BIGINT NOT NULL,
method_full_name VARCHAR NOT NULL,
code VARCHAR,
order_index INTEGER,
argument_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR
);
Properties (from CPG spec): - METHOD_FULL_NAME: Fully-qualified name of referenced method - TYPE_FULL_NAME: Type of the method (e.g., “int(*)(int, int)” in C) - CODE: How the reference appears in source - OFFSET/OFFSET_END: Precise source location
Purpose: - Represent methods passed as arguments (higher-order functions) - Function pointers (C/C++) - Lambda expressions / closures - Method handles (Java) - Delegate types (C#)
Examples:
// C function pointer:
int (*func_ptr)(int) = &myFunction;
// METHOD_REF: method_full_name="myFunction", type_full_name="int(*)(int)"
// Java method reference:
list.forEach(System.out::println);
// METHOD_REF: method_full_name="System.out.println", type_full_name="Consumer<Object>"
// Python:
callback = some_function
// METHOD_REF: method_full_name="some_function"
Note: METHOD_REF is used when a method is referenced but not called at that location.
nodes_type_ref¶
Type reference nodes (type as value).
CREATE TABLE nodes_type_ref (
id BIGINT NOT NULL,
type_full_name VARCHAR NOT NULL,
code VARCHAR,
order_index INTEGER,
argument_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR
);
Properties (from CPG spec): - TYPE_FULL_NAME: Fully-qualified name of referenced type - CODE: How the reference appears in source - OFFSET/OFFSET_END: Precise source location
Purpose: - Represent types used as values (not instantiations) - typeof/typeid operations - Type casting - Reflection (Java .class, C# typeof) - Type arguments to generics
Examples:
// Java reflection:
Class<?> clazz = String.class;
// TYPE_REF: type_full_name="java.lang.String"
// C++ type casting:
auto* ptr = static_cast<MyClass*>(obj);
// TYPE_REF: type_full_name="MyClass"
// Generic type argument:
List<Integer> list = new ArrayList<>();
// TYPE_REF: type_full_name="java.lang.Integer"
Note: TYPE_REF is used when a type is referenced as a value, not when creating an instance.
nodes_unknown¶
Unknown AST nodes (catch-all for unsupported constructs).
CREATE TABLE nodes_unknown (
id BIGINT NOT NULL,
parser_type_name VARCHAR NOT NULL,
code VARCHAR,
type_full_name VARCHAR,
order_index INTEGER,
argument_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR
);
Properties (from CPG spec): - PARSER_TYPE_NAME: Name of construct as emitted by parser - TYPE_FULL_NAME: Best-effort type inference - CODE: Source code representation
Purpose: - Include AST nodes not specified in CPG spec - Language-specific constructs - Experimental/proprietary language features - Maintain complete AST even for unsupported features
Examples:
# Python walrus operator (if not in spec):
if (n := len(items)) > 10:
...
# UNKNOWN: parser_type_name="NamedExpr", code="n := len(items)"
# Proprietary language extension:
@CustomDirective
# UNKNOWN: parser_type_name="CustomDirective"
Note: UNKNOWN should be used sparingly - prefer proper node types when available.
nodes_jump_target¶
Jump targets (labels for goto/break/continue).
CREATE TABLE nodes_jump_target (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
code VARCHAR,
argument_index INTEGER,
order_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR
);
Properties (from CPG spec): - NAME: Label name - PARSER_TYPE_NAME: Type of label construct - CODE: Label as it appears in source
Purpose: - Represent jump targets (goto labels, case labels) - Support control flow analysis with jumps - Enable goto-based code analysis - Track switch case targets
Examples:
// C goto label:
error_handler:
cleanup();
return -1;
// JUMP_TARGET: name="error_handler", parser_type_name="Label"
// Switch case label:
switch (x) {
case 42: // JUMP_TARGET: name="case_42"
break;
}
Note: Modern languages discourage goto, but it’s common in C/assembly.
nodes_type_param¶
Type parameters (generics/templates formal parameters).
CREATE TABLE nodes_type_param (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
constraint_type VARCHAR,
"index" INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR,
method_id BIGINT,
type_decl_id BIGINT
);
Properties (from CPG spec):
- NAME: Type parameter name (e.g., “T”, “K”, “V”)
- CONSTRAINT_TYPE: Upper bound constraint (e.g., “Comparable” for T extends Comparable)
- INDEX: Position among sibling type parameters
- METHOD_ID: Owning method (for method-level generics)
- TYPE_DECL_ID: Owning type declaration (for class-level generics)
Purpose:
- Formal type parameters in generic/template declarations
- Java Generics: class List<T>
- C++ Templates: template<typename T>
- C# Generics: class Dictionary<TKey, TValue>
Examples:
// Java:
class Box<T> { // TYPE_PARAMETER: name="T"
T value;
}
// C++:
template<typename K, typename V>
// TYPE_PARAMETER: name="K"
// TYPE_PARAMETER: name="V"
class Map { ... }
Note: TYPE_PARAMETER is the formal parameter, TYPE_ARGUMENT is the actual type used.
nodes_type_argument¶
Type arguments (generics/templates actual arguments).
CREATE TABLE nodes_type_argument (
id BIGINT NOT NULL,
code VARCHAR,
order_index INTEGER
);
Properties (from CPG spec): - CODE: Type argument code (e.g., “Integer”, “String”)
Purpose:
- Actual type arguments in generic/template instantiations
- Connects to TYPE_PARAMETER via BINDS_TO edge
- Java: List<Integer> - “Integer” is TYPE_ARGUMENT
- C++: vector<int> - “int” is TYPE_ARGUMENT
Examples:
List<Integer> list = new ArrayList<String>();
// TYPE_ARGUMENT: code="Integer" (for List)
// TYPE_ARGUMENT: code="String" (for ArrayList)
Map<String, Integer> map;
// TYPE_ARGUMENT: code="String"
// TYPE_ARGUMENT: code="Integer"
Note: TYPE_ARGUMENT instances bind to TYPE_PARAMETER declarations.
nodes_binding¶
Name-signature bindings (method resolution).
CREATE TABLE nodes_binding (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
signature VARCHAR,
method_full_name VARCHAR
);
Properties (from CPG spec): - NAME: Method name - SIGNATURE: Method signature - METHOD_FULL_NAME: Fully-qualified resolved method name
Purpose: - Resolve (name, signature) pairs at type declarations - Support virtual method dispatch - Enable polymorphism analysis - Connect TYPE_DECL to bound methods
Examples:
// Type declaration with methods:
class Animal {
void speak() { } // BINDING: name="speak", signature="void()"
}
class Dog extends Animal {
@Override
void speak() { } // BINDING: name="speak", signature="void()"
}
// BINDING nodes allow resolving which speak() is called
Note: BINDING connects TYPE_DECL to METHOD via BINDS and REF edges.
nodes_closure_binding¶
Closure variable capture (lambda/closure bindings).
CREATE TABLE nodes_closure_binding (
id BIGINT NOT NULL,
closure_binding_id VARCHAR,
closure_original_name VARCHAR,
evaluation_strategy VARCHAR,
filename VARCHAR
);
Properties (from CPG spec): - CLOSURE_BINDING_ID: Unique identifier for this capture - EVALUATION_STRATEGY: How variable is captured - CODE: Captured variable name
Purpose: - Represent variable capture in closures/lambdas - Connect captured LOCAL/PARAM to closure - Support closure analysis - Enable escape analysis
Examples:
function outer(x) {
let y = 10;
return function inner() {
console.log(x + y); // x and y are captured
};
}
// CLOSURE_BINDING for x: closure_binding_id="outer.inner.x"
// CLOSURE_BINDING for y: closure_binding_id="outer.inner.y"
// Java lambda:
int multiplier = 2;
list.forEach(item -> item * multiplier);
// CLOSURE_BINDING for multiplier
Note: CLOSURE_BINDING connects to LOCAL via CAPTURED_BY and to METHOD_REF via CAPTURE.
nodes_comment¶
Source code comments.
CREATE TABLE nodes_comment (
id BIGINT NOT NULL,
code VARCHAR NOT NULL,
filename VARCHAR,
hash VARCHAR,
line_number INTEGER,
line_number_end INTEGER,
column_number INTEGER,
column_number_end INTEGER,
"offset" INTEGER,
"offset_end" INTEGER,
order_index INTEGER,
containing_method_id BIGINT,
comment_type VARCHAR,
documented_node_id BIGINT,
binding_type VARCHAR
);
Properties (from CPG spec): - CODE: Comment text (including delimiters) - FILENAME: Source file containing comment - OFFSET/OFFSET_END: Precise location
Purpose: - Preserve source code comments - Documentation extraction - Code annotation analysis - Comment-based security markers
Examples:
// Single-line comment
// COMMENT: code="// Single-line comment"
/* Multi-line
comment */
// COMMENT: code="/* Multi-line\n comment */"
/** JavaDoc comment
* @param x Parameter description
*/
// COMMENT: code="/** JavaDoc...*/"
Note: Comments are AST nodes connected to FILE via AST edges. COMMENT_TYPE classifies comments (TODO, FIXME, HACK, XXX, DOCSTRING). DOCUMENTED_NODE_ID links to the node this comment documents.
nodes_modifier¶
Access modifiers (CPG schema compatibility).
CREATE TABLE nodes_modifier (
id BIGINT NOT NULL,
modifier_type VARCHAR NOT NULL,
code VARCHAR,
order_index INTEGER
);
Properties: - MODIFIER_TYPE: Modifier kind (STATIC, PUBLIC, PROTECTED, PRIVATE, ABSTRACT, NATIVE, CONSTRUCTOR, VIRTUAL, INTERNAL, FINAL, READONLY, MODULE)
nodes_annotation¶
Method/class annotations and decorators.
CREATE TABLE nodes_annotation (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
full_name VARCHAR,
code VARCHAR,
order_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR
);
Properties: - NAME: Annotation name (e.g., “Override”, “Deprecated”) - FULL_NAME: Fully-qualified annotation name
nodes_type¶
Type instances (CPG schema compatibility).
CREATE TABLE nodes_type (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
full_name VARCHAR NOT NULL,
type_decl_full_name VARCHAR
);
Properties: - NAME, FULL_NAME: Type identification - TYPE_DECL_FULL_NAME: Link to the corresponding TYPE_DECL
nodes_jump_label¶
Jump labels (for goto statements).
CREATE TABLE nodes_jump_label (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
code VARCHAR,
order_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR
);
nodes_tag_v2¶
External context tags (enrichment system).
CREATE TABLE nodes_tag_v2 (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
value VARCHAR,
external_source VARCHAR,
external_id VARCHAR,
external_url VARCHAR,
confidence FLOAT,
metadata JSON,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP
);
Properties: - NAME: Tag category - VALUE: Tag value - EXTERNAL_SOURCE: Source system (e.g., “domain_config”, “vcs”, “enrichment”) - CONFIDENCE: Tag confidence score (0.0-1.0)
nodes_finding¶
Static analysis findings (security, quality).
CREATE TABLE nodes_finding (
id BIGINT NOT NULL,
title VARCHAR NOT NULL,
description VARCHAR,
severity VARCHAR,
category VARCHAR,
confidence FLOAT DEFAULT 1.0,
source VARCHAR,
rule_id VARCHAR,
status VARCHAR DEFAULT 'open',
metadata JSON,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP
);
Properties: - TITLE: Finding summary - SEVERITY: error, warning, info, hint - CATEGORY: Finding category (security, quality, performance) - SOURCE: Generator (e.g., “pattern_match”, “finding_generation”) - RULE_ID: Pattern rule ID (if generated by pattern scan) - STATUS: open, resolved, suppressed
nodes_macro¶
C preprocessor macros.
CREATE TABLE nodes_macro (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
body VARCHAR,
is_function_like BOOLEAN DEFAULT FALSE,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR
);
Properties:
- NAME: Macro name
- BODY: Macro expansion body
- IS_FUNCTION_LIKE: Whether the macro has parameters (e.g., #define MAX(a,b))
nodes_macro_param¶
Parameters of function-like macros.
CREATE TABLE nodes_macro_param (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
index_ INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR,
macro_id BIGINT
);
Properties: - NAME: Parameter name - INDEX_: Position in the macro parameter list - MACRO_ID: ID of the containing macro
nodes_namespace¶
Namespace index nodes (distinct from NAMESPACE_BLOCK).
CREATE TABLE nodes_namespace (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
code VARCHAR,
order_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR
);
Note: NAMESPACE is the index node, NAMESPACE_BLOCK is the scope. NAMESPACE nodes are auto-generated from NAMESPACE_BLOCK nodes.
nodes_annotation_literal¶
Literal values in annotations.
CREATE TABLE nodes_annotation_literal (
id BIGINT NOT NULL,
name VARCHAR,
code VARCHAR,
order_index INTEGER,
argument_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR
);
nodes_annotation_parameter¶
Formal annotation parameters.
CREATE TABLE nodes_annotation_parameter (
id BIGINT NOT NULL,
code VARCHAR,
order_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR
);
nodes_annotation_parameter_assign¶
Annotation argument-to-parameter mapping.
CREATE TABLE nodes_annotation_parameter_assign (
id BIGINT NOT NULL,
code VARCHAR,
order_index INTEGER,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR
);
nodes_key_value_pair¶
Key-value pairs for findings.
CREATE TABLE nodes_key_value_pair (
id BIGINT NOT NULL,
key VARCHAR NOT NULL,
value VARCHAR
);
nodes_location¶
Source code location summary.
CREATE TABLE nodes_location (
id BIGINT NOT NULL,
class_name VARCHAR,
class_short_name VARCHAR,
method_short_name VARCHAR,
node_label VARCHAR,
package_name VARCHAR,
symbol VARCHAR,
filename VARCHAR,
line_number INTEGER,
method_full_name VARCHAR
);
nodes_collection_decl¶
Named collection declarations (dict, list, set, tuple, enum).
CREATE TABLE nodes_collection_decl (
id BIGINT NOT NULL,
parent_id BIGINT,
class_id BIGINT,
name VARCHAR,
full_name VARCHAR,
collection_type VARCHAR,
element_count INTEGER,
filename VARCHAR,
line_number INTEGER,
keys_json VARCHAR,
value_type_hint VARCHAR
);
Properties:
- NAME: Variable name (e.g. CWE_DATABASE)
- FULL_NAME: FQN (e.g. src/security/kb.py:CWE_DATABASE)
- COLLECTION_TYPE: dict, list, set, tuple, enum
- ELEMENT_COUNT: Number of top-level elements
- KEYS_JSON: JSON array of dict keys (max 256, dict only)
- VALUE_TYPE_HINT: Inferred value type if homogeneous
Currently supported: Python frontend only. JS/TS/Go/Java — planned.
nodes_config_file¶
Configuration file content.
CREATE TABLE nodes_config_file (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
content VARCHAR
);
nodes_import¶
Import/include statements.
CREATE TABLE nodes_import (
id BIGINT NOT NULL,
imported_entity VARCHAR NOT NULL,
imported_as VARCHAR,
is_wildcard BOOLEAN DEFAULT FALSE,
is_explicit BOOLEAN DEFAULT TRUE,
code VARCHAR,
line_number INTEGER,
column_number INTEGER,
filename VARCHAR
);
Properties:
- IMPORTED_ENTITY: What is imported (e.g., “fmt”, “os.path”, “stdio.h”)
- IMPORTED_AS: Alias (e.g., import f "fmt" -> imported_as=”f”)
- IS_WILDCARD: Wildcard import (e.g., import . "pkg", from os import *)
- IS_EXPLICIT: Explicit import (vs. implicit)
Examples:
Go: import "fmt" -> imported_entity="fmt"
Go: import f "fmt" -> imported_entity="fmt", imported_as="f"
C: #include <stdio.h> -> imported_entity="stdio.h"
Python: from os import path -> imported_entity="os.path"
JS: import { foo } from 'bar' -> imported_entity="bar"
Edge Tables¶
edges_ast¶
Abstract Syntax Tree edges (parent-child relationships).
CREATE TABLE edges_ast (
src BIGINT NOT NULL,
dst BIGINT NOT NULL
);
CREATE INDEX idx_ast_src ON edges_ast(src);
CREATE INDEX idx_ast_dst ON edges_ast(dst);
edges_cfg¶
Control Flow Graph edges.
CREATE TABLE edges_cfg (
src BIGINT NOT NULL,
dst BIGINT NOT NULL
);
CREATE INDEX idx_cfg_src ON edges_cfg(src);
CREATE INDEX idx_cfg_dst ON edges_cfg(dst);
edges_call¶
Call site to method declaration edges.
CREATE TABLE edges_call (
src BIGINT NOT NULL,
dst BIGINT NOT NULL,
cross_language BOOLEAN DEFAULT FALSE,
binding_type VARCHAR DEFAULT ''
);
CREATE INDEX idx_call_edge_src ON edges_call(src);
CREATE INDEX idx_call_edge_dst ON edges_call(dst);
Properties: - CROSS_LANGUAGE: Whether this is a cross-language call (e.g., CGO Go->C, ctypes Python->C) - BINDING_TYPE: How the call was resolved (e.g., “exact”, “name_only”, “import_path”)
edges_ref¶
Reference edges (identifier to declaration).
CREATE TABLE edges_ref (
src BIGINT NOT NULL, -- IDENTIFIER/CALL node id
dst BIGINT NOT NULL -- DECLARATION node id (LOCAL, PARAM, METHOD, TYPE_DECL)
);
CREATE INDEX idx_ref_src ON edges_ref(src);
CREATE INDEX idx_ref_dst ON edges_ref(dst);
edges_reaching_def¶
Data flow edges (reaching definitions).
CREATE TABLE edges_reaching_def (
src BIGINT NOT NULL,
dst BIGINT NOT NULL,
variable VARCHAR -- Variable name
);
CREATE INDEX idx_reaching_def_src ON edges_reaching_def(src);
CREATE INDEX idx_reaching_def_dst ON edges_reaching_def(dst);
CREATE INDEX idx_reaching_def_variable ON edges_reaching_def(variable);
Properties (from CPG spec): - VARIABLE: Variable name being tracked
edges_argument¶
Argument edges (call to argument expressions, return to returned expression).
CREATE TABLE edges_argument (
src BIGINT NOT NULL,
dst BIGINT NOT NULL,
argument_index INTEGER,
argument_name VARCHAR
);
CREATE INDEX idx_argument_src ON edges_argument(src);
CREATE INDEX idx_argument_dst ON edges_argument(dst);
Properties: - ARGUMENT_INDEX: Position of argument in argument list - ARGUMENT_NAME: Named argument name (for languages with keyword arguments)
edges_receiver¶
Receiver edges (call to receiver object).
CREATE TABLE edges_receiver (
src BIGINT NOT NULL, -- CALL node id
dst BIGINT NOT NULL -- Receiver expression id
);
edges_condition¶
Condition edges (control structure to conditional expression).
CREATE TABLE edges_condition (
src BIGINT NOT NULL, -- CONTROL_STRUCTURE node id
dst BIGINT NOT NULL -- Expression node id
);
edges_dominate¶
Immediate dominator edges (control flow domination).
CREATE TABLE edges_dominate (
src BIGINT NOT NULL,
dst BIGINT NOT NULL
);
CREATE INDEX idx_dominate_src ON edges_dominate(src);
CREATE INDEX idx_dominate_dst ON edges_dominate(dst);
edges_post_dominate¶
Post-dominator edges.
CREATE TABLE edges_post_dominate (
src BIGINT NOT NULL,
dst BIGINT NOT NULL
);
CREATE INDEX idx_post_dominate_src ON edges_post_dominate(src);
CREATE INDEX idx_post_dominate_dst ON edges_post_dominate(dst);
edges_cdg¶
Control Dependence Graph edges (CRITICAL for PDG).
CREATE TABLE edges_cdg (
src BIGINT NOT NULL, -- Control structure node id (condition/branch)
dst BIGINT NOT NULL -- Dependent node id (code that depends on condition)
);
Properties (from CPG spec): - CDG edge means: dst is control-dependent on src - Essential for Program Dependence Graph (PDG = DDG + CDG) - Used for program slicing, security analysis, compiler optimizations - Example: statements inside IF block are control-dependent on IF condition
edges_binds¶
Binding edges (name bindings).
CREATE TABLE edges_binds (
src BIGINT NOT NULL, -- BINDING node id
dst BIGINT NOT NULL -- METHOD or TYPE_DECL node id
);
Properties (from CPG spec): - Connects BINDING nodes to their declarations - Used for variable/function name resolution - Example: import statement binds name to actual definition
edges_binds_to¶
Reverse binding edges (name uses).
CREATE TABLE edges_binds_to (
src BIGINT NOT NULL, -- Variable/function reference node id
dst BIGINT NOT NULL -- BINDING node id
);
Properties (from CPG spec): - Reverse of BINDS edge - Connects uses of names to their bindings - Example: variable reference → binding → declaration
BINDS workflow:
Declaration (METHOD/TYPE_DECL)
↑
BINDS
|
BINDING node (import/using statement)
↑
BINDS_TO
|
Reference (IDENTIFIER/CALL)
edges_source_file¶
Source file edges (node to file mapping).
CREATE TABLE edges_source_file (
src BIGINT NOT NULL, -- Any AST node id
dst BIGINT NOT NULL -- FILE node id
);
CREATE INDEX idx_source_file_src ON edges_source_file(src);
CREATE INDEX idx_source_file_dst ON edges_source_file(dst);
Properties (from CPG spec): - Connects nodes to their source FILE - Auto-created based on FILENAME properties - MUST NOT be created by language frontend - created automatically - One-to-one relationship: each node has exactly one source file
Purpose: - Map any code element back to its source file - Navigate from FILE to all contained elements - Support file-based queries and analysis - Enable IDE “go to file” functionality
Example:
METHOD node (id=100) → SOURCE_FILE → FILE node (id=1, name="main.c")
CALL node (id=200) → SOURCE_FILE → FILE node (id=1, name="main.c")
TYPE_DECL node (id=300) → SOURCE_FILE → FILE node (id=2, name="types.h")
Auto-creation logic: 1. Frontend sets FILENAME property on nodes (METHOD, TYPE_DECL, etc.) 2. CPG loader creates FILE nodes for unique filenames 3. CPG loader creates SOURCE_FILE edges from nodes to FILE nodes 4. Results in complete file→code mapping
edges_alias_of¶
Type alias edges.
CREATE TABLE edges_alias_of (
src BIGINT, -- TYPE_DECL node (alias)
dst BIGINT -- TYPE node (actual type)
);
Properties (from CPG spec): - Connects TYPE_DECL (alias) to TYPE (actual) - MUST NOT be created by frontend - auto-created from ALIAS_TYPE_FULL_NAME - One-to-one relationship
Purpose: - Represent type aliases (C typedef, using, type aliases) - Enable alias resolution - Support type synonym analysis
Examples:
// C typedef:
typedef int Integer;
// TYPE_DECL "Integer" --ALIAS_OF--> TYPE "int"
// C++ using:
using String = std::string;
// TYPE_DECL "String" --ALIAS_OF--> TYPE "std::string"
// Rust type alias:
type Result<T> = std::result::Result<T, Error>;
// TYPE_DECL "Result" --ALIAS_OF--> TYPE "std::result::Result"
Note: Auto-generated when CPG is loaded based on ALIAS_TYPE_FULL_NAME property.
edges_inherits_from¶
Type inheritance edges.
CREATE TABLE edges_inherits_from (
src BIGINT, -- TYPE_DECL node (derived)
dst BIGINT -- TYPE node (base)
);
CREATE INDEX idx_inherits_from_src ON edges_inherits_from(src);
CREATE INDEX idx_inherits_from_dst ON edges_inherits_from(dst);
Properties (from CPG spec): - Connects TYPE_DECL (derived) to TYPE (base) - MUST NOT be created by frontend - auto-created from INHERITS_FROM_TYPE_FULL_NAME - One-to-many relationship (multiple inheritance supported)
Purpose: - Represent class/interface inheritance - Enable polymorphism analysis - Support type hierarchy queries - Track inheritance chains
Examples:
// Java single inheritance:
class Dog extends Animal implements Comparable {
...
}
// TYPE_DECL "Dog" --INHERITS_FROM--> TYPE "Animal"
// TYPE_DECL "Dog" --INHERITS_FROM--> TYPE "Comparable"
// C++ multiple inheritance:
class D : public A, public B { };
// TYPE_DECL "D" --INHERITS_FROM--> TYPE "A"
// TYPE_DECL "D" --INHERITS_FROM--> TYPE "B"
Note: Auto-generated when CPG is loaded based on INHERITS_FROM_TYPE_FULL_NAME array.
edges_capture¶
Closure capture edges.
CREATE TABLE edges_capture (
src BIGINT NOT NULL,
dst BIGINT NOT NULL,
variable_name VARCHAR
);
Properties: - VARIABLE_NAME: Name of the captured variable
Properties (from CPG spec): - Connects METHOD_REF/TYPE_REF to CLOSURE_BINDING - Represents variable capture in closure/lambda - One-to-many relationship (closure can capture multiple variables)
Purpose: - Track which variables are captured by closures - Enable escape analysis - Support closure optimization - Identify captured variable lifetimes
Examples:
function outer() {
let x = 10;
let y = 20;
return function inner() {
return x + y; // captures x and y
};
}
// METHOD_REF "inner" --CAPTURE--> CLOSURE_BINDING for x
// METHOD_REF "inner" --CAPTURE--> CLOSURE_BINDING for y
Note: CAPTURE edge connects closure to its captured variables.
edges_contains¶
Structural containment edges (method to node).
CREATE TABLE edges_contains (
src BIGINT NOT NULL,
dst BIGINT NOT NULL
);
CREATE INDEX idx_contains_src ON edges_contains(src);
CREATE INDEX idx_contains_dst ON edges_contains(dst);
Properties: - Connects methods/files to their contained nodes - Used for structural queries (e.g., “find all calls in method X”)
edges_eval_type¶
Type evaluation edges (node to type).
CREATE TABLE edges_eval_type (
src BIGINT NOT NULL,
dst BIGINT NOT NULL
);
CREATE INDEX idx_eval_type_src ON edges_eval_type(src);
CREATE INDEX idx_eval_type_dst ON edges_eval_type(dst);
edges_tagged_by¶
Tag association edges (node to tag).
CREATE TABLE edges_tagged_by (
src BIGINT NOT NULL,
dst BIGINT NOT NULL
);
CREATE INDEX idx_tagged_by_src ON edges_tagged_by(src);
CREATE INDEX idx_tagged_by_dst ON edges_tagged_by(dst);
edges_parameter_link¶
Parameter link edges (input to output param).
CREATE TABLE edges_parameter_link (
src BIGINT NOT NULL,
dst BIGINT NOT NULL
);
edges_pdg¶
Program Dependence Graph edges (CDG + data dependencies).
CREATE TABLE edges_pdg (
src BIGINT NOT NULL,
dst BIGINT NOT NULL
);
edges_ddg¶
Data Dependency Graph edges (CodeGraph compatibility).
CREATE TABLE edges_ddg (
src BIGINT NOT NULL,
dst BIGINT NOT NULL
);
edges_vtable¶
Virtual table edges (virtual method dispatch).
CREATE TABLE edges_vtable (
src BIGINT NOT NULL,
dst BIGINT NOT NULL
);
edges_documents¶
Comment to documented node edges.
CREATE TABLE edges_documents (
src BIGINT NOT NULL,
dst BIGINT NOT NULL,
binding_type VARCHAR
);
Properties: - BINDING_TYPE: How the comment was linked to the node (e.g., “proximity”, “docstring”) - Connects nodes_comment to the documented node (method, type, etc.)
Views¶
call_containment¶
Denormalized caller/callee view used by Python CodeGraph for architecture, retrieval, and workflow queries.
CREATE VIEW call_containment AS
SELECT DISTINCT
caller.name AS containing_method_name,
caller.full_name AS containing_method_full_name,
callee.name AS callee_name,
callee.full_name AS callee_full_name,
caller.filename AS caller_filename,
callee.filename AS callee_filename,
nc.line_number AS call_line_number,
ec.cross_language AS cross_language,
ec.binding_type AS binding_type,
caller_file.language AS caller_language,
callee_file.language AS callee_language
FROM edges_call ec
JOIN nodes_call nc ON ec.src = nc.id
JOIN nodes_method caller ON nc.containing_method_id = caller.id
JOIN nodes_method callee ON ec.dst = callee.id
LEFT JOIN nodes_file caller_file ON caller.filename = caller_file.name
LEFT JOIN nodes_file callee_file ON callee.filename = callee_file.name;
Columns (11): - containing_method_name, containing_method_full_name: Caller method - callee_name, callee_full_name: Callee method - caller_filename, callee_filename: Source files - call_line_number: Line of the call site - cross_language: Whether call crosses language boundary - binding_type: How the call was resolved - caller_language, callee_language: Languages of caller and callee
method_docstrings¶
Maps comments to documented methods (used for function description extraction).
CREATE VIEW method_docstrings AS
SELECT
m.id AS method_id,
m.name AS method_name,
m.full_name,
m.filename,
m.line_number AS method_line,
c.code AS docstring,
c.line_number AS comment_line
FROM edges_documents ed
JOIN nodes_comment c ON ed.src = c.id
JOIN nodes_method m ON ed.dst = m.id;
State Tables¶
Tables for incremental updates, FQN index, and git tracking.
cpg_fqn_index¶
Fast symbol resolution index (inspired by Yandex SourceCraft).
CREATE TABLE cpg_fqn_index (
fqn VARCHAR NOT NULL,
node_id BIGINT NOT NULL,
node_type VARCHAR NOT NULL,
name VARCHAR NOT NULL,
signature VARCHAR,
filename VARCHAR NOT NULL,
visibility VARCHAR DEFAULT 'public',
param_count INTEGER DEFAULT 0,
return_type VARCHAR,
PRIMARY KEY (fqn, node_type)
);
cpg_file_state¶
Track file state for incremental updates.
CREATE TABLE cpg_file_state (
filename VARCHAR PRIMARY KEY,
hash VARCHAR NOT NULL,
ast_hash VARCHAR,
mtime BIGINT,
parsed_at BIGINT DEFAULT (epoch_ms(now())),
branch_name VARCHAR
);
cpg_git_state¶
Track last known git commit.
CREATE TABLE cpg_git_state (
id INTEGER PRIMARY KEY DEFAULT 1,
commit_hash VARCHAR NOT NULL,
branch VARCHAR,
updated_at BIGINT DEFAULT (epoch_ms(now()))
);
cpg_branch_state¶
Track CPG state per branch for branch-aware updates.
CREATE TABLE cpg_branch_state (
branch_name VARCHAR PRIMARY KEY,
last_commit_sha VARCHAR(40) NOT NULL,
last_commit_timestamp BIGINT,
file_count INTEGER DEFAULT 0,
node_count INTEGER DEFAULT 0,
is_active BOOLEAN DEFAULT FALSE,
created_at BIGINT DEFAULT (epoch_ms(now())),
updated_at BIGINT DEFAULT (epoch_ms(now()))
);
cpg_submodule_state¶
Track git submodule state for submodule-aware updates.
CREATE TABLE cpg_submodule_state (
submodule_path VARCHAR(4096) PRIMARY KEY,
submodule_url VARCHAR(4096),
current_commit_sha VARCHAR(40),
parent_commit_sha VARCHAR(40),
is_initialized BOOLEAN DEFAULT FALSE,
is_recursive BOOLEAN DEFAULT FALSE,
last_updated BIGINT DEFAULT (epoch_ms(now()))
);
export_progress¶
Export progress for resumable imports.
CREATE TABLE export_progress (
entity_type VARCHAR PRIMARY KEY,
total_count BIGINT,
exported_count BIGINT,
last_offset BIGINT,
status VARCHAR,
last_updated BIGINT,
error_message VARCHAR
);
Domain and Pattern Tables¶
cpg_domain_annotations¶
Structured function-level domain metadata.
CREATE TABLE cpg_domain_annotations (
id BIGINT NOT NULL,
method_id BIGINT,
function_name VARCHAR NOT NULL,
annotation_type VARCHAR NOT NULL,
category VARCHAR,
value VARCHAR,
confidence FLOAT DEFAULT 1.0,
source VARCHAR DEFAULT 'domain_config',
metadata JSON
);
cpg_domain_subsystems¶
Project organizational structure.
CREATE TABLE cpg_domain_subsystems (
id BIGINT NOT NULL,
name VARCHAR NOT NULL,
display_name VARCHAR,
description VARCHAR,
patterns VARCHAR[],
key_files VARCHAR[],
parent_id BIGINT
);
cpg_domain_subsystem_members¶
Method to subsystem membership.
CREATE TABLE cpg_domain_subsystem_members (
method_id BIGINT NOT NULL,
subsystem_id BIGINT NOT NULL,
confidence FLOAT DEFAULT 1.0,
source VARCHAR DEFAULT 'domain_config',
PRIMARY KEY (method_id, subsystem_id)
);
finding_evidence¶
Links findings to CPG nodes with role-based evidence chain.
CREATE TABLE finding_evidence (
id BIGINT NOT NULL,
finding_id BIGINT NOT NULL,
node_id BIGINT,
role VARCHAR,
description VARCHAR,
filename VARCHAR,
line_number INTEGER,
code VARCHAR
);
cpg_enrichment_state¶
Tracks what enrichments have been computed.
CREATE TABLE cpg_enrichment_state (
enrichment_type VARCHAR NOT NULL,
source VARCHAR NOT NULL,
scope VARCHAR DEFAULT 'global',
last_computed TIMESTAMP,
version VARCHAR,
item_count INTEGER,
needs_refresh BOOLEAN DEFAULT FALSE,
PRIMARY KEY (enrichment_type, source, scope)
);
cpg_enrichment_anchors¶
FQN-based stable identity for enrichment tags across updates.
CREATE TABLE cpg_enrichment_anchors (
tag_id BIGINT NOT NULL,
anchor_fqn VARCHAR NOT NULL,
anchor_hash VARCHAR,
filename VARCHAR,
source VARCHAR,
PRIMARY KEY (tag_id)
);
cpg_pattern_results¶
Pattern matching results (CPG-aware structural search).
CREATE TABLE cpg_pattern_results (
id BIGINT PRIMARY KEY,
rule_id VARCHAR NOT NULL,
node_id BIGINT NOT NULL,
filename VARCHAR NOT NULL,
line_number INTEGER NOT NULL,
column_number INTEGER DEFAULT 0,
code VARCHAR,
message VARCHAR,
severity VARCHAR NOT NULL,
category VARCHAR,
confidence DOUBLE DEFAULT 1.0,
match_data VARCHAR,
cpg_context VARCHAR,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
cpg_pattern_rules¶
Pattern matching rules loaded for scan.
CREATE TABLE cpg_pattern_rules (
rule_id VARCHAR PRIMARY KEY,
language VARCHAR NOT NULL,
severity VARCHAR NOT NULL,
category VARCHAR,
has_cpg BOOLEAN DEFAULT FALSE,
rule_source VARCHAR,
loaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
cpg_pattern_deps¶
Pattern dependency tracking for incremental invalidation.
CREATE TABLE cpg_pattern_deps (
rule_id VARCHAR NOT NULL,
filename VARCHAR NOT NULL,
dep_filename VARCHAR NOT NULL,
dep_type VARCHAR NOT NULL,
PRIMARY KEY (rule_id, filename, dep_filename, dep_type)
);
Property Graph Definition¶
Using duckpgq to create a unified property graph with full CPG schema support.
KEY INSIGHT: DuckDB PGQ does not support views in VERTEX TABLES, but we can use materialized tables instead! This allows full support for polymorphic edges.
Step 1: Create Materialized Unified Nodes Table¶
Important: Use CREATE TABLE (not CREATE VIEW) to materialize the unified node set.
-- Create materialized unified table of all CPG nodes for polymorphic edge support
DROP TABLE IF EXISTS cpg_nodes;
CREATE TABLE cpg_nodes AS
SELECT id, 'FILE' as node_type FROM nodes_file
UNION ALL SELECT id, 'NAMESPACE_BLOCK' FROM nodes_namespace_block
UNION ALL SELECT id, 'METHOD' FROM nodes_method
UNION ALL SELECT id, 'METHOD_REF' FROM nodes_method_ref
UNION ALL SELECT id, 'CALL' FROM nodes_call
UNION ALL SELECT id, 'IDENTIFIER' FROM nodes_identifier
UNION ALL SELECT id, 'FIELD_IDENTIFIER' FROM nodes_field_identifier
UNION ALL SELECT id, 'LITERAL' FROM nodes_literal
UNION ALL SELECT id, 'LOCAL' FROM nodes_local
UNION ALL SELECT id, 'PARAM' FROM nodes_param
UNION ALL SELECT id, 'PARAM_OUT' FROM nodes_method_parameter_out
UNION ALL SELECT id, 'METHOD_RETURN' FROM nodes_method_return
UNION ALL SELECT id, 'RETURN' FROM nodes_return
UNION ALL SELECT id, 'BLOCK' FROM nodes_block
UNION ALL SELECT id, 'CONTROL_STRUCTURE' FROM nodes_control_structure
UNION ALL SELECT id, 'MEMBER' FROM nodes_member
UNION ALL SELECT id, 'TYPE_DECL' FROM nodes_type_decl
UNION ALL SELECT id, 'TYPE_REF' FROM nodes_type_ref
UNION ALL SELECT id, 'TYPE_PARAM' FROM nodes_type_param
UNION ALL SELECT id, 'TYPE_ARGUMENT' FROM nodes_type_argument
UNION ALL SELECT id, 'UNKNOWN' FROM nodes_unknown
UNION ALL SELECT id, 'JUMP_TARGET' FROM nodes_jump_target
UNION ALL SELECT id, 'BINDING' FROM nodes_binding
UNION ALL SELECT id, 'CLOSURE_BINDING' FROM nodes_closure_binding
UNION ALL SELECT id, 'COMMENT' FROM nodes_comment
UNION ALL SELECT id, 'COLLECTION_DECL' FROM nodes_collection_decl;
-- Add primary key and indexes
ALTER TABLE cpg_nodes ADD PRIMARY KEY (id);
CREATE INDEX idx_cpg_nodes_type ON cpg_nodes(node_type);
Step 2: Create Comprehensive Property Graph¶
Full Implementation (with ALL edge types including polymorphic edges):
CREATE PROPERTY GRAPH cpg
VERTEX TABLES (
-- Materialized unified node table for polymorphic edge support
cpg_nodes LABEL CPG_NODE,
-- Individual typed node tables for specific queries
nodes_file LABEL FILE_NODE,
nodes_namespace_block LABEL NAMESPACE_BLOCK,
nodes_method LABEL METHOD,
nodes_method_ref LABEL METHOD_REF,
nodes_call LABEL CALL_NODE,
nodes_identifier LABEL IDENTIFIER,
nodes_field_identifier LABEL FIELD_IDENTIFIER,
nodes_literal LABEL LITERAL,
nodes_local LABEL LOCAL,
nodes_param LABEL PARAM,
nodes_method_parameter_out LABEL PARAM_OUT,
nodes_method_return LABEL METHOD_RETURN,
nodes_return LABEL RETURN_NODE,
nodes_block LABEL BLOCK,
nodes_control_structure LABEL CONTROL_STRUCTURE,
nodes_member LABEL MEMBER,
nodes_type_decl LABEL TYPE_DECL,
nodes_type_ref LABEL TYPE_REF,
nodes_type_param LABEL TYPE_PARAM,
nodes_type_argument LABEL TYPE_ARGUMENT,
nodes_unknown LABEL UNKNOWN,
nodes_jump_target LABEL JUMP_TARGET,
nodes_binding LABEL BINDING,
nodes_closure_binding LABEL CLOSURE_BINDING,
nodes_comment LABEL COMMENT_NODE,
nodes_metadata LABEL METADATA
)
EDGE TABLES (
-- ========================================
-- POLYMORPHIC EDGES (via cpg_nodes table)
-- ========================================
edges_ast
SOURCE KEY (src) REFERENCES cpg_nodes (id)
DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
LABEL AST,
edges_cfg
SOURCE KEY (src) REFERENCES cpg_nodes (id)
DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
LABEL CFG,
edges_ref
SOURCE KEY (src) REFERENCES cpg_nodes (id)
DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
LABEL REF,
edges_reaching_def
SOURCE KEY (src) REFERENCES cpg_nodes (id)
DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
LABEL REACHING_DEF,
edges_argument
SOURCE KEY (src) REFERENCES cpg_nodes (id)
DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
LABEL ARGUMENT,
edges_dominate
SOURCE KEY (src) REFERENCES cpg_nodes (id)
DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
LABEL DOMINATE,
edges_post_dominate
SOURCE KEY (src) REFERENCES cpg_nodes (id)
DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
LABEL POST_DOMINATE,
edges_cdg
SOURCE KEY (src) REFERENCES cpg_nodes (id)
DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
LABEL CDG,
edges_binds
SOURCE KEY (src) REFERENCES cpg_nodes (id)
DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
LABEL BINDS,
edges_binds_to
SOURCE KEY (src) REFERENCES cpg_nodes (id)
DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
LABEL BINDS_TO,
-- ========================================
-- TYPED EDGES (specific source/destination)
-- ========================================
edges_call
SOURCE KEY (src) REFERENCES nodes_call (id)
DESTINATION KEY (dst) REFERENCES nodes_method (id)
LABEL CALLS,
edges_receiver
SOURCE KEY (src) REFERENCES nodes_call (id)
DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
LABEL RECEIVER,
edges_condition
SOURCE KEY (src) REFERENCES nodes_control_structure (id)
DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
LABEL CONDITION,
edges_source_file
SOURCE KEY (src) REFERENCES cpg_nodes (id)
DESTINATION KEY (dst) REFERENCES nodes_file (id)
LABEL SOURCE_FILE,
edges_alias_of
SOURCE KEY (src) REFERENCES nodes_type_decl (id)
DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
LABEL ALIAS_OF,
edges_inherits_from
SOURCE KEY (src) REFERENCES nodes_type_decl (id)
DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
LABEL INHERITS_FROM,
edges_capture
SOURCE KEY (src) REFERENCES cpg_nodes (id)
DESTINATION KEY (dst) REFERENCES nodes_closure_binding (id)
LABEL CAPTURE,
edges_documents
SOURCE KEY (src) REFERENCES cpg_nodes (id)
DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
LABEL DOCUMENTS
);
Key Features:
- ✅ Full CPG schema support with ALL edge types
- ✅ Polymorphic edges (AST, CFG, REF, etc.) via materialized cpg_nodes table
- ✅ Typed vertices for efficient targeted queries
- ✅ DuckDB PGQ compatible (uses tables, not views)
- ✅ 100% CPG spec v1.1 compliant
Example Queries¶
Standard SQL Query: Find all calls to a specific method¶
SELECT c.*, m.name, m.filename, m.line_number
FROM nodes_call c
JOIN edges_call ec ON c.id = ec.src
JOIN nodes_method m ON ec.dst = m.id
WHERE m.full_name = 'com.example.MyClass.myMethod:void()';
DuckDB PGQ Query: Find direct call chains (caller -> callee)¶
SELECT *
FROM GRAPH_TABLE(cpg
MATCH (caller:METHOD)-[:CALLS]-(call:CALL_NODE)-[:CALLS]->(callee:METHOD)
WHERE caller.name = 'main'
COLUMNS (caller.full_name AS caller_name, callee.full_name AS callee_name)
);
DuckDB PGQ Query: Find methods and their AST children¶
SELECT *
FROM GRAPH_TABLE(cpg
MATCH (m:METHOD)-[:AST]->(child:CPG_NODE)
COLUMNS (m.full_name AS method, child.id AS child_id)
);
DuckDB PGQ Query: Data flow paths using REACHING_DEF¶
SELECT *
FROM GRAPH_TABLE(cpg
MATCH (source:CPG_NODE)-[:REACHING_DEF*1..5]->(sink:CPG_NODE)
COLUMNS (source.id, sink.id)
)
LIMIT 100;
DuckDB PGQ Query: CFG paths (control flow)¶
SELECT *
FROM GRAPH_TABLE(cpg
MATCH (start:CPG_NODE)-[:CFG*1..3]->(end:CPG_NODE)
COLUMNS (start.id AS start_node, end.id AS end_node)
)
LIMIT 100;
DuckDB PGQ Query: Find all identifiers and their references¶
SELECT *
FROM GRAPH_TABLE(cpg
MATCH (id:IDENTIFIER)-[:REF]->(decl:CPG_NODE)
COLUMNS (id.name AS identifier_name, decl.id AS declaration_id)
);
DuckDB PGQ Query: Type hierarchy (inheritance)¶
SELECT *
FROM GRAPH_TABLE(cpg
MATCH (derived:TYPE_DECL)-[:INHERITS_FROM]->(base:CPG_NODE)
COLUMNS (derived.full_name AS derived_type, base.id AS base_type)
);
Combined Query: Methods with most incoming calls¶
SELECT m.full_name, COUNT(*) as call_count
FROM GRAPH_TABLE(cpg
MATCH (c:CALL_NODE)-[:CALLS]->(m:METHOD)
COLUMNS (m.full_name, m.id)
)
GROUP BY m.full_name
ORDER BY call_count DESC
LIMIT 10;
Performance Considerations¶
- Batching: Use 10,000 row batches for INSERT operations
- Indexing: Create indexes after bulk load for faster imports
- Partitioning: Consider partitioning by filename for large codebases
- Compression: Use DuckDB’s automatic compression for TEXT columns
- Memory: Configure DuckDB memory limit based on CPG size
Schema Version¶
- CPG Spec: v1.1
- DuckDB: 1.4.2+
- duckpgq: Latest stable
- Schema Version: 7.0 (Full schema.go alignment)
- Last Updated: 2026-02-28
Changelog¶
v7.0 (2026-02-28) - Full schema.go alignment¶
MAJOR UPDATE: All table definitions now match gocpg/pkg/storage/duckdb/schema.go exactly
New node tables:
- nodes_modifier - Access modifier nodes
- nodes_annotation - Method/class annotations
- nodes_type - Type instances
- nodes_jump_label - Jump labels (goto)
- nodes_tag_v2 - External context tags (replaces legacy nodes_tag)
- nodes_finding - Static analysis findings
- nodes_macro - C preprocessor macros
- nodes_macro_param - Macro parameters
- nodes_namespace - Namespace index nodes
- nodes_annotation_literal - Annotation literal values
- nodes_annotation_parameter - Formal annotation parameters
- nodes_annotation_parameter_assign - Annotation parameter assignments
- nodes_key_value_pair - Key-value pairs for findings
- nodes_location - Source code location summary
- nodes_config_file - Configuration file content
- nodes_import - Import/include statements
- nodes_closure_binding - Closure variable capture
New edge tables:
- edges_contains - Structural containment
- edges_eval_type - Type evaluation
- edges_tagged_by - Tag association
- edges_parameter_link - Parameter links
- edges_pdg - Program Dependence Graph
- edges_ddg - Data Dependency Graph
- edges_vtable - Virtual table entries
- edges_documents - Comment documentation links
New views:
- call_containment - Denormalized caller/callee view (11 columns)
- method_docstrings - Comment to method mapping
New state/domain/pattern tables:
- cpg_fqn_index, cpg_file_state, cpg_git_state, cpg_branch_state, cpg_submodule_state, export_progress
- cpg_domain_annotations, cpg_domain_subsystems, cpg_domain_subsystem_members
- finding_evidence, cpg_enrichment_state, cpg_enrichment_anchors
- cpg_pattern_results, cpg_pattern_rules, cpg_pattern_deps
Column changes:
- nodes_method: Added is_nested, loc, parameter_count, embedding*, ast_hash
- nodes_call: Added containing_method_id, callee_method_id, type_origin, type_confidence, embedding*
- nodes_param: Added method_id, parent_param_id, typedef_id, struct_member_id
- nodes_method_return: Added method_id
- nodes_comment: Added hash, line_number_end, column_number_end, containing_method_id, comment_type, documented_node_id, binding_type
- edges_call: Added cross_language, binding_type
- edges_argument: Added argument_index, argument_name
- edges_capture: Added variable_name
- edges_documents: Added binding_type
Table renames:
- nodes_type_parameter -> nodes_type_param (with new columns: constraint_type, index, method_id, type_decl_id)
- nodes_param_out -> nodes_method_parameter_out
v6.0 (2026-02-26) - Pre-computed Metrics & Pattern Flags¶
MAJOR UPDATE: Added pre-computed boolean flags and metrics to nodes_method for fast queries
New nodes_method Columns:
- has_disabled_code (BOOLEAN) — Detects #if 0, if false, commented-out blocks
- has_deprecated (BOOLEAN) — Detects @Deprecated, [[deprecated]] annotations
- has_todo_fixme (BOOLEAN) — Detects TODO, FIXME, HACK, XXX comments
- has_debug_code (BOOLEAN) — Detects debug prints, console.log, debugging statements
- is_test (BOOLEAN) — Cross-language test function detection (11 languages)
- is_entry_point (BOOLEAN) — Public API entry point detection (exported, HTTP handler, main)
- cyclomatic_complexity (INTEGER) — McCabe cyclomatic complexity metric
- fan_in (INTEGER) — Number of callers
- fan_out (INTEGER) — Number of callees
Impact: - 10-100x faster pattern queries vs. LIKE scans on code/name columns - Cross-language test file detection (pytest, JUnit, Go testing, etc.) - Pre-computed metrics enable instant complexity/coupling analysis - DuckDB 1.4.2+ required for DuckPGQ graph query extension
v5.0 (2025-11-16) - Complete Compliance¶
MAJOR UPDATE: Achieved 100% CPG schema compliance with all remaining features
New Node Types:
- nodes_unknown (UNKNOWN) - Catch-all for unsupported AST constructs
- nodes_jump_target (JUMP_TARGET) - Labels for goto/break/continue statements
- nodes_type_parameter (TYPE_PARAMETER) - Generic/template formal parameters
- nodes_type_argument (TYPE_ARGUMENT) - Generic/template actual arguments
- nodes_binding (BINDING) - Method resolution bindings
- nodes_closure_binding (CLOSURE_BINDING) - Variable capture in closures
- nodes_comment (COMMENT) - Source code comments
New Edge Types:
- edges_alias_of (ALIAS_OF) - Type alias relationships
- edges_inherits_from (INHERITS_FROM) - Type inheritance relationships
- edges_capture (CAPTURE) - Closure variable capture
- edges_captured_by (CAPTURED_BY) - Reverse closure capture
Impact: - 100% CPG schema compliance achieved - Complete AST coverage with UNKNOWN nodes - Full generic/template support - Closure and lambda analysis enabled - Type alias and inheritance resolution - Comment preservation for documentation - Complete program analysis capabilities
Compliance: ~95% → 100% CPG schema
v4.0 (2025-11-16) - Namespace and File Support¶
MAJOR UPDATE: Added file mapping, namespace support, and method/type references
New Node Types:
- nodes_file (FILE) - Source file nodes for file-based indexing
- nodes_namespace_block (NAMESPACE_BLOCK) - Namespace blocks (C++ namespace, Java package)
- nodes_method_ref (METHOD_REF) - Method references (higher-order functions, function pointers)
- nodes_type_ref (TYPE_REF) - Type references (reflection, type casting, generics)
New Edge Types:
- edges_source_file (SOURCE_FILE) - Node to file mapping (auto-created from FILENAME)
Impact: - File-based code navigation enabled - Namespace-aware analysis supported - Higher-order function tracking (callbacks, delegates) - Type reflection and generics support - Complete source file mapping (IDE integration ready) - Cross-file dependency analysis improved
Compliance: ~90% → ~95% CPG schema compliance
v3.0 (2025-11-16) - OOP Support¶
MAJOR UPDATE: Added OOP analysis support and precise source mapping
New Node Types:
- nodes_field_identifier (FIELD_IDENTIFIER) - Field access nodes for OOP
- nodes_member (MEMBER) - Class/struct field declarations
New Edge Types:
- edges_binds (BINDS) - Name binding edges (import/using statements)
- edges_binds_to (BINDS_TO) - Reverse binding edges (name resolution)
New Properties: - OFFSET, OFFSET_END: Precise byte-level source mapping (added to METHOD, TYPE_DECL, IDENTIFIER, FIELD_IDENTIFIER, MEMBER) - MODIFIER: Access modifiers array (added to METHOD, TYPE_DECL) - CANONICAL_NAME: Normalized identifier name (added to FIELD_IDENTIFIER)
Impact: - OOP code analysis fully supported (field access tracking) - Precise source code location mapping (byte-level) - Visibility analysis enabled (PUBLIC, PRIVATE, STATIC, etc.) - Variable/function name resolution improved - Alias analysis enabled (canonical names)
Compliance: ~80% → ~90% CPG schema compliance
v2.0 (2025-11-16) - Critical Updates¶
MAJOR UPDATE: Added critical components for PDG and SSA analysis
New Node Types:
- nodes_param_out (METHOD_PARAMETER_OUT) - Output parameters for SSA analysis
- nodes_method_return (METHOD_RETURN) - Formal return parameter
New Edge Types:
- edges_cdg (Control Dependence Graph) - Critical for PDG!
Impact: - PDG now complete (DDG + CDG) - SSA analysis now possible - Program slicing enabled - Security taint analysis improved
Compliance: ~70% → ~80% CPG schema compliance
v1.0 (2025-11-15) - Initial Release¶
- 11 node types (METHOD, CALL, IDENTIFIER, LITERAL, LOCAL, PARAM, RETURN, BLOCK, CONTROL_STRUCTURE, TYPE_DECL, METADATA)
- 10 edge types (AST, CFG, CALL, REF, REACHING_DEF, ARGUMENT, RECEIVER, CONDITION, DOMINATE, POST_DOMINATE)
Extension: Semantic Tag System¶
Overview¶
The tag system extends the CPG with semantic annotations for methods and other code elements. Tags are custom extensions (not part of CPG spec v1.1) that enable semantic search, code classification, and intelligent analysis.
nodes_tag_v2¶
Stores tag definitions with external source tracking and confidence scoring (semantic labels for code elements).
CREATE TABLE nodes_tag_v2 (
id BIGINT PRIMARY KEY,
name VARCHAR NOT NULL,
value VARCHAR,
external_source VARCHAR,
external_id VARCHAR,
external_url VARCHAR,
confidence DOUBLE DEFAULT 1.0,
metadata VARCHAR,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_tag_v2_name ON nodes_tag_v2(name);
CREATE INDEX idx_tag_v2_name_value ON nodes_tag_v2(name, value);
CREATE INDEX idx_tag_external ON nodes_tag_v2(external_source, external_id);
Properties: - NAME: Tag category identifier (unique semantic dimension) - VALUE: Tag value within that category - EXTERNAL_SOURCE: Origin system for the tag (e.g., ‘jira’, ‘sonarqube’) - EXTERNAL_ID: Identifier in the external system - EXTERNAL_URL: Link to the external resource - CONFIDENCE: Confidence score for the tag assignment (0.0-1.0, default 1.0) - METADATA: Additional JSON metadata for the tag
edges_tagged_by (Extension)¶
Connects code elements to their semantic tags.
CREATE TABLE edges_tagged_by (
src BIGINT NOT NULL, -- Source node id (typically METHOD)
dst BIGINT NOT NULL -- Tag node id (nodes_tag_v2)
);
CREATE INDEX idx_tagged_by_src ON edges_tagged_by(src);
CREATE INDEX idx_tagged_by_dst ON edges_tagged_by(dst);
Relationship: - Source: Any CPG node (typically nodes_method) - Destination: nodes_tag_v2 entry
Tag Categories¶
| Category | Description | Example Values |
|---|---|---|
subsystem-name |
Code organizational unit | ‘executor’, ‘planner’, ‘parser’, ‘storage’ |
security-risk |
Security classification | ‘critical’, ‘high’, ‘medium’, ‘low’ |
taint-source |
Untrusted data entry point | Boolean tagging |
taint-sink |
Security-sensitive output | Boolean tagging |
perf-hotspot |
Performance critical code | Boolean tagging |
allocation-heavy |
Memory-intensive methods | Boolean tagging |
test-coverage |
Has associated tests | Boolean tagging |
cyclomatic-complexity |
Code complexity metric | Numeric values (e.g., ‘15’, ‘25’) |
function-purpose |
Semantic description | Free-form text description |
entry-point |
System entry point | Boolean tagging |
Example Tag Queries¶
-- Find all security-critical methods
SELECT m.*
FROM nodes_method m
JOIN edges_tagged_by e ON m.id = e.src
JOIN nodes_tag_v2 t ON e.dst = t.id
WHERE t.name = 'security-risk' AND t.value = 'critical';
-- List all subsystems
SELECT DISTINCT t.value as subsystem
FROM nodes_tag_v2 t
WHERE t.name = 'subsystem-name'
ORDER BY t.value;
-- Find methods in a specific subsystem
SELECT m.full_name, m.filename, m.line_number
FROM nodes_method m
JOIN edges_tagged_by e ON m.id = e.src
JOIN nodes_tag_v2 t ON e.dst = t.id
WHERE t.name = 'subsystem-name' AND t.value = 'executor';
-- Get complexity hotspots
SELECT m.full_name, t.value as complexity
FROM nodes_method m
JOIN edges_tagged_by e ON m.id = e.src
JOIN nodes_tag_v2 t ON e.dst = t.id
WHERE t.name = 'cyclomatic-complexity'
AND CAST(t.value AS INTEGER) > 20
ORDER BY CAST(t.value AS INTEGER) DESC;
-- Count tags by category
SELECT t.name, COUNT(*) as count
FROM nodes_tag_v2 t
JOIN edges_tagged_by e ON t.id = e.dst
GROUP BY t.name
ORDER BY count DESC;
Tag Statistics¶
Current database contains approximately: - 15.68M tags across 98 categories - Primary categories: subsystem-name, security-risk, function-purpose - Tags enable semantic code search and intelligent analysis
Integration Notes¶
Tags are enriched post-CPG generation through:
1. Static analysis: CPG tag queries (cpg.method.tag.name(...))
2. LLM enrichment: AI-generated function purpose descriptions
3. Computed metrics: Cyclomatic complexity from CFG analysis
4. Manual annotation: Security audit findings
Tags extend the CPG without modifying the core schema, maintaining compatibility with standard CPG tools while enabling advanced semantic analysis.
Last updated: 2026-02-28