DuckDB CPG Schema Design (CPG Spec v1.1)

DuckDB CPG Schema Design (CPG Spec v1.1)

Table of Contents

Overview

This schema implements the Code Property Graph specification v1.1 in DuckDB using the duckpgq extension for efficient property graph queries.

Core Design Principles

  1. Node Tables: Separate tables for each major node type (METHOD, CALL, IDENTIFIER, etc.)
  2. Edge Tables: Separate tables for each edge type (AST, CFG, CALL, REF, REACHING_DEF, etc.)
  3. Property Graph: Use duckpgq’s CREATE PROPERTY GRAPH for unified graph queries
  4. Efficient Indexing: B-tree indexes on id, full_name, and frequently queried properties
  5. Batch Processing: Support for large-scale CPG imports (50K+ methods)

GoCPG vs Legacy Schema

GoCPG generates DuckDB natively and produces additional tables not present in legacy Joern exports:

GoCPG-only table Purpose
nodes_import Import/include statements with resolved paths
nodes_finding Static analysis findings (security, quality)
nodes_macro Preprocessor macros (C/C++)
edges_ddg Data Dependence Graph edges
edges_pdg Program Dependence Graph edges
edges_contains Containment edges (method→node)
edges_parameter_link Parameter to argument links
edges_eval_type Type evaluation edges

GoCPG also pre-computes cyclomatic_complexity on nodes_method and method_id on nodes_param.

Note on cpg_nodes: The materialized cpg_nodes table (UNION ALL of all node tables) is used only for the DuckPGQ Property Graph definition. It should not be queried directly in application code — use the specific nodes_* tables instead. GoCPG databases do not generate cpg_nodes; the compatibility layer creates it on demand when Property Graph queries are needed.

Node Tables

nodes_method

Core table for function/method declarations.

CREATE TABLE nodes_method (
    id                      BIGINT NOT NULL,
    name                    VARCHAR NOT NULL,
    full_name               VARCHAR NOT NULL,
    signature               VARCHAR,
    filename                VARCHAR,
    line_number             INTEGER,
    line_number_end         INTEGER,
    column_number           INTEGER,
    column_number_end       INTEGER,
    code                    VARCHAR,
    is_external             BOOLEAN DEFAULT FALSE,
    ast_parent_type         VARCHAR,
    ast_parent_full_name    VARCHAR,
    loc                     INTEGER,
    parameter_count         INTEGER,
    -- Pre-computed pattern flags (v6.0)
    has_disabled_code       BOOLEAN DEFAULT FALSE,
    has_deprecated          BOOLEAN DEFAULT FALSE,
    has_todo_fixme          BOOLEAN DEFAULT FALSE,
    has_debug_code          BOOLEAN DEFAULT FALSE,
    -- Classification flags (v6.0)
    is_test                 BOOLEAN DEFAULT FALSE,
    is_entry_point          BOOLEAN DEFAULT FALSE,
    is_nested               BOOLEAN DEFAULT FALSE,
    -- Pre-computed metrics (v6.0)
    cyclomatic_complexity   INTEGER DEFAULT 0,
    fan_in                  INTEGER DEFAULT 0,
    fan_out                 INTEGER DEFAULT 0,
    -- Embedding (populated externally by ChromaDB import)
    embedding               FLOAT[],
    embedding_model         VARCHAR,
    embedding_updated_at    TIMESTAMP,
    -- AST hash (for incremental change detection)
    ast_hash                VARCHAR
);

Properties (from CPG spec): - FULL_NAME, NAME, SIGNATURE: Method identification - IS_EXTERNAL: Whether defined in source - AST_PARENT_FULL_NAME, AST_PARENT_TYPE: Type context - FILENAME, LINE_NUMBER, COLUMN_NUMBER: Source location - CODE: Method source code (truncated to 1000 chars) - LOC: Lines of code - PARAMETER_COUNT: Number of parameters - AST_HASH: SHA256 hash of AST structure (for incremental change detection)

Pre-computed Pattern Flags (v6.0 – GoCPG): - HAS_DISABLED_CODE: Contains #if 0, if false, or commented-out blocks - HAS_DEPRECATED: Contains @Deprecated, [[deprecated]], or similar annotations - HAS_TODO_FIXME: Contains TODO, FIXME, HACK, or XXX comments - HAS_DEBUG_CODE: Contains debug prints, console.log, or debugging statements

Classification Flags (v6.0 – GoCPG): - IS_TEST: Method identified as a test function (cross-language detection) - IS_ENTRY_POINT: Method is a public API entry point (exported, HTTP handler, main, etc.) - IS_NESTED: Method is defined inside another method (closure/inner function)

Pre-computed Metrics (v6.0 – GoCPG): - CYCLOMATIC_COMPLEXITY: McCabe cyclomatic complexity metric - FAN_IN: Number of methods that call this method - FAN_OUT: Number of methods called by this method

Pre-computed Pattern Flag Queries:

-- Find complex methods with high fan-out (potential god methods)
SELECT full_name, cyclomatic_complexity, fan_out
FROM nodes_method
WHERE cyclomatic_complexity > 20 AND fan_out > 15
ORDER BY cyclomatic_complexity DESC;

-- Find deprecated methods still being called
SELECT m.full_name, m.fan_in
FROM nodes_method m
WHERE m.has_deprecated = TRUE AND m.fan_in > 0;

-- Find test vs production code ratio
SELECT is_test, COUNT(*) as count
FROM nodes_method
GROUP BY is_test;

-- Find public API entry points with high complexity
SELECT full_name, cyclomatic_complexity, fan_in
FROM nodes_method
WHERE is_entry_point = TRUE AND cyclomatic_complexity > 10
ORDER BY cyclomatic_complexity DESC;

nodes_call

Represents function/method invocations.

CREATE TABLE nodes_call (
    id                      BIGINT NOT NULL,
    name                    VARCHAR,
    method_full_name        VARCHAR,
    signature               VARCHAR,
    dispatch_type           VARCHAR,
    code                    VARCHAR,
    line_number             INTEGER,
    column_number           INTEGER,
    argument_index          INTEGER,
    filename                VARCHAR,
    type_full_name          VARCHAR,
    containing_method_id    BIGINT,
    callee_method_id        BIGINT,
    type_origin             VARCHAR DEFAULT '',
    type_confidence         DOUBLE DEFAULT 0.0,
    embedding               FLOAT[],
    embedding_model         VARCHAR,
    embedding_updated_at    TIMESTAMP
);

Properties (from CPG spec): - METHOD_FULL_NAME: Target method - DISPATCH_TYPE: Call mechanism (STATIC_DISPATCH, DYNAMIC_DISPATCH) - TYPE_FULL_NAME: Return type - SIGNATURE: Parameter types - CONTAINING_METHOD_ID: ID of the method containing this call site - CALLEE_METHOD_ID: ID of the resolved callee method - TYPE_ORIGIN: Source of type inference (e.g., TypeRecoveryPass) - TYPE_CONFIDENCE: Confidence score for type inference (0.0-1.0)

nodes_identifier

Variable and reference names.

CREATE TABLE nodes_identifier (
    id                      BIGINT NOT NULL,
    name                    VARCHAR NOT NULL,
    type_full_name          VARCHAR,
    code                    VARCHAR,
    order_index             INTEGER,
    argument_index          INTEGER,
    line_number             INTEGER,
    column_number           INTEGER,
    filename                VARCHAR,
    containing_method_id    BIGINT
);

Properties (from CPG spec): - NAME: Variable identifier - TYPE_FULL_NAME: Variable type - CONTAINING_METHOD_ID: ID of the containing method

nodes_field_identifier

Field access identifiers (OOP - e.g., obj.field).

CREATE TABLE nodes_field_identifier (
    id              BIGINT NOT NULL,
    canonical_name  VARCHAR NOT NULL,
    code            VARCHAR,
    order_index     INTEGER,
    argument_index  INTEGER,
    line_number     INTEGER,
    column_number   INTEGER,
    filename        VARCHAR
);

Properties (from CPG spec): - CANONICAL_NAME: Normalized name (e.g., “myField” for both a.myField and b.myField) - CODE: Field access as written (e.g., “obj.field”) - Purpose: Identify field accesses in OOP code (critical for alias analysis)

Example:

struct Point { int x, y; };
Point p;
p.x = 10;  // <- "x" is FIELD_IDENTIFIER with canonical_name="x"

nodes_literal

Constant values.

CREATE TABLE nodes_literal (
    id                      BIGINT NOT NULL,
    code                    VARCHAR NOT NULL,
    type_full_name          VARCHAR,
    order_index             INTEGER,
    argument_index          INTEGER,
    line_number             INTEGER,
    column_number           INTEGER,
    filename                VARCHAR,
    containing_method_id    BIGINT
);

Properties (from CPG spec): - TYPE_FULL_NAME: Literal type - CODE: Literal value

nodes_local

Local variable declarations.

CREATE TABLE nodes_local (
    id                      BIGINT NOT NULL,
    name                    VARCHAR NOT NULL,
    code                    VARCHAR,
    type_full_name          VARCHAR,
    order_index             INTEGER,
    line_number             INTEGER,
    column_number           INTEGER,
    filename                VARCHAR,
    containing_method_id    BIGINT
);

Properties (from CPG spec): - NAME: Local variable name - TYPE_FULL_NAME: Declared type

nodes_param

Method parameters (formal parameters).

CREATE TABLE nodes_param (
    id                  BIGINT NOT NULL,
    name                VARCHAR NOT NULL,
    code                VARCHAR,
    type_full_name      VARCHAR,
    index               INTEGER,
    is_variadic         BOOLEAN DEFAULT FALSE,
    evaluation_strategy VARCHAR,
    line_number         INTEGER,
    column_number       INTEGER,
    filename            VARCHAR,
    method_id           BIGINT,
    parent_param_id     BIGINT,
    typedef_id          BIGINT,
    struct_member_id    BIGINT
);

Properties (from CPG spec): - INDEX: Parameter position - IS_VARIADIC: Variable-length parameter - EVALUATION_STRATEGY: BY_VALUE, BY_REFERENCE, BY_SHARING - METHOD_ID: ID of the containing method - PARENT_PARAM_ID: ID of the parent parameter (for nested/destructured params) - TYPEDEF_ID: ID of the typedef node (for C typedef resolution) - STRUCT_MEMBER_ID: ID of the struct member node (for C struct member resolution)

nodes_method_parameter_out

Method output parameters (for SSA/data flow analysis). Go multiple returns.

CREATE TABLE nodes_method_parameter_out (
    id                  BIGINT NOT NULL,
    name                VARCHAR,
    type_full_name      VARCHAR,
    code                VARCHAR,
    index               INTEGER,
    evaluation_strategy VARCHAR,
    line_number         INTEGER,
    column_number       INTEGER,
    filename            VARCHAR,
    method_id           BIGINT
);

Properties (from CPG spec): - Corresponds to METHOD_PARAMETER_IN for data flow - INDEX: Parameter position (matches input parameter) - EVALUATION_STRATEGY: BY_VALUE, BY_REFERENCE, BY_SHARING - METHOD_ID: ID of the containing method - Required for SSA (Static Single Assignment) analysis

nodes_method_return

Method return parameter (formal return).

CREATE TABLE nodes_method_return (
    id                  BIGINT NOT NULL,
    type_full_name      VARCHAR,
    code                VARCHAR,
    evaluation_strategy VARCHAR,
    order_index         INTEGER,
    line_number         INTEGER,
    column_number       INTEGER,
    filename            VARCHAR,
    method_id           BIGINT
);

Properties (from CPG spec): - TYPE_FULL_NAME: Return type - CODE: Typically “RET” or empty - EVALUATION_STRATEGY: How return value is passed - METHOD_ID: ID of the containing method - One per method (formal return parameter, not return statement)

nodes_return

Return statements (actual return in code).

CREATE TABLE nodes_return (
    id                      BIGINT NOT NULL,
    code                    VARCHAR,
    order_index             INTEGER,
    argument_index          INTEGER,
    line_number             INTEGER,
    column_number           INTEGER,
    filename                VARCHAR,
    containing_method_id    BIGINT
);

Note: This is the RETURN statement node. Different from METHOD_RETURN which is the formal return parameter.

nodes_block

Compound statements (code blocks).

CREATE TABLE nodes_block (
    id                      BIGINT NOT NULL,
    type_full_name          VARCHAR,
    code                    VARCHAR,
    order_index             INTEGER,
    argument_index          INTEGER,
    line_number             INTEGER,
    column_number           INTEGER,
    filename                VARCHAR,
    containing_method_id    BIGINT
);

nodes_control_structure

Control flow constructs (if, while, for, etc.).

CREATE TABLE nodes_control_structure (
    id                      BIGINT NOT NULL,
    control_structure_type  VARCHAR NOT NULL,
    parser_type_name        VARCHAR,
    code                    VARCHAR,
    order_index             INTEGER,
    argument_index          INTEGER,
    line_number             INTEGER,
    column_number           INTEGER,
    filename                VARCHAR,
    containing_method_id    BIGINT
);

Properties (from CPG spec): - CONTROL_STRUCTURE_TYPE: BREAK, CONTINUE, DO, WHILE, FOR, GOTO, IF, ELSE, TRY, THROW, SWITCH

nodes_member

Type members (fields of classes/structs).

CREATE TABLE nodes_member (
    id              BIGINT NOT NULL,
    name            VARCHAR NOT NULL,
    code            VARCHAR,
    type_full_name  VARCHAR,
    order_index     INTEGER,
    line_number     INTEGER,
    column_number   INTEGER,
    filename        VARCHAR
);

Properties (from CPG spec): - NAME: Member name (e.g., “x”, “y”) - TYPE_FULL_NAME: Member type (e.g., “int”, “std::string”) - AST_PARENT_FULL_NAME: Containing type - Purpose: Represent fields/members of classes/structs

Example:

struct Point {
    int x;     // <- MEMBER: name="x", type_full_name="int"
    int y;     // <- MEMBER: name="y", type_full_name="int"
};

nodes_type_decl

Type declarations (classes, structs).

CREATE TABLE nodes_type_decl (
    id                              BIGINT NOT NULL,
    name                            VARCHAR NOT NULL,
    full_name                       VARCHAR NOT NULL,
    alias_type_full_name            VARCHAR,
    inherits_from_type_full_name    VARCHAR[],
    is_external                     BOOLEAN DEFAULT FALSE,
    filename                        VARCHAR,
    line_number                     INTEGER,
    ast_parent_type                 VARCHAR,
    ast_parent_full_name            VARCHAR,
    code                            VARCHAR
);

Properties (from CPG spec): - FULL_NAME, NAME: Type identification - IS_EXTERNAL: Whether defined in source - INHERITS_FROM_TYPE_FULL_NAME: Base types (array) - ALIAS_TYPE_FULL_NAME: Type alias - AST_PARENT_TYPE, AST_PARENT_FULL_NAME: Parent type context

nodes_metadata

CPG metadata (required by spec).

CREATE TABLE nodes_metadata (
    id              BIGINT NOT NULL,
    language        VARCHAR NOT NULL,
    version         VARCHAR DEFAULT '1.1',
    root            VARCHAR,
    overlays        VARCHAR[],
    hash            VARCHAR
);

Properties (from CPG spec): - LANGUAGE: Source language - VERSION: CPG spec version (default “1.1”) - ROOT: Root path - OVERLAYS: Applied overlays - HASH: Content hash

nodes_file

Source file nodes (required by spec).

CREATE TABLE nodes_file (
    id              BIGINT NOT NULL,
    name            VARCHAR NOT NULL,
    hash            VARCHAR,
    ast_hash        VARCHAR,
    content         VARCHAR,
    size_bytes      BIGINT,
    language        VARCHAR
);

Properties (from CPG spec): - NAME: File path relative to root (from METADATA.ROOT) - HASH: SHA-256 or MD5 hash of file content - AST_HASH: Hash of AST structure (for incremental change detection) - CONTENT: Optional - full source code of the file - SIZE_BYTES: File size in bytes - LANGUAGE: Detected programming language

Purpose: - Index for looking up all code elements by file - Root nodes of Abstract Syntax Trees (AST) - Source file metadata storage - Required for SOURCE_FILE edges

Example:

name="src/main.c", hash="abc123...", order_index=0

Note: Each source file SHOULD have exactly one FILE node. FILE nodes serve as AST roots and allow navigation from file to all contained code elements.

nodes_namespace_block

Namespace block nodes (namespace scopes).

CREATE TABLE nodes_namespace_block (
    id          BIGINT NOT NULL,
    name        VARCHAR NOT NULL,
    full_name   VARCHAR NOT NULL,
    filename    VARCHAR,
    order_index INTEGER
);

Properties (from CPG spec): - NAME: Human-readable namespace name (e.g., “foo.bar”) - Dot-separated: “foo.bar” means namespace “bar” inside “foo” - FULL_NAME: Unique identifier combining file and namespace - Should include file info to ensure uniqueness - FILENAME: Source file containing this namespace block - ORDER_INDEX: Position in parent AST

Purpose: - Represent namespace blocks (C++ namespace{}, Java package) - Structure code into logical units - Allow namespace-based code queries - Support multi-file namespace analysis

Examples:

// C++:
namespace foo {
    namespace bar {
        // code
    }
}
// NAME="foo.bar", FULL_NAME="main.cpp:foo.bar"

// Java:
package com.example.myapp;
// NAME="com.example.myapp", FULL_NAME="Main.java:com.example.myapp"

Note: NAMESPACE nodes (indices) are auto-generated from NAMESPACE_BLOCK nodes when CPG is loaded.

nodes_method_ref

Method reference nodes (method as value).

CREATE TABLE nodes_method_ref (
    id                  BIGINT NOT NULL,
    method_full_name    VARCHAR NOT NULL,
    code                VARCHAR,
    order_index         INTEGER,
    argument_index      INTEGER,
    line_number         INTEGER,
    column_number       INTEGER,
    filename            VARCHAR
);

Properties (from CPG spec): - METHOD_FULL_NAME: Fully-qualified name of referenced method - TYPE_FULL_NAME: Type of the method (e.g., “int(*)(int, int)” in C) - CODE: How the reference appears in source - OFFSET/OFFSET_END: Precise source location

Purpose: - Represent methods passed as arguments (higher-order functions) - Function pointers (C/C++) - Lambda expressions / closures - Method handles (Java) - Delegate types (C#)

Examples:

// C function pointer:
int (*func_ptr)(int) = &myFunction;
// METHOD_REF: method_full_name="myFunction", type_full_name="int(*)(int)"

// Java method reference:
list.forEach(System.out::println);
// METHOD_REF: method_full_name="System.out.println", type_full_name="Consumer<Object>"

// Python:
callback = some_function
// METHOD_REF: method_full_name="some_function"

Note: METHOD_REF is used when a method is referenced but not called at that location.

nodes_type_ref

Type reference nodes (type as value).

CREATE TABLE nodes_type_ref (
    id              BIGINT NOT NULL,
    type_full_name  VARCHAR NOT NULL,
    code            VARCHAR,
    order_index     INTEGER,
    argument_index  INTEGER,
    line_number     INTEGER,
    column_number   INTEGER,
    filename        VARCHAR
);

Properties (from CPG spec): - TYPE_FULL_NAME: Fully-qualified name of referenced type - CODE: How the reference appears in source - OFFSET/OFFSET_END: Precise source location

Purpose: - Represent types used as values (not instantiations) - typeof/typeid operations - Type casting - Reflection (Java .class, C# typeof) - Type arguments to generics

Examples:

// Java reflection:
Class<?> clazz = String.class;
// TYPE_REF: type_full_name="java.lang.String"

// C++ type casting:
auto* ptr = static_cast<MyClass*>(obj);
// TYPE_REF: type_full_name="MyClass"

// Generic type argument:
List<Integer> list = new ArrayList<>();
// TYPE_REF: type_full_name="java.lang.Integer"

Note: TYPE_REF is used when a type is referenced as a value, not when creating an instance.

nodes_unknown

Unknown AST nodes (catch-all for unsupported constructs).

CREATE TABLE nodes_unknown (
    id                  BIGINT NOT NULL,
    parser_type_name    VARCHAR NOT NULL,
    code                VARCHAR,
    type_full_name      VARCHAR,
    order_index         INTEGER,
    argument_index      INTEGER,
    line_number         INTEGER,
    column_number       INTEGER,
    filename            VARCHAR
);

Properties (from CPG spec): - PARSER_TYPE_NAME: Name of construct as emitted by parser - TYPE_FULL_NAME: Best-effort type inference - CODE: Source code representation

Purpose: - Include AST nodes not specified in CPG spec - Language-specific constructs - Experimental/proprietary language features - Maintain complete AST even for unsupported features

Examples:

# Python walrus operator (if not in spec):
if (n := len(items)) > 10:
    ...
# UNKNOWN: parser_type_name="NamedExpr", code="n := len(items)"

# Proprietary language extension:
@CustomDirective
# UNKNOWN: parser_type_name="CustomDirective"

Note: UNKNOWN should be used sparingly - prefer proper node types when available.

nodes_jump_target

Jump targets (labels for goto/break/continue).

CREATE TABLE nodes_jump_target (
    id              BIGINT NOT NULL,
    name            VARCHAR NOT NULL,
    code            VARCHAR,
    argument_index  INTEGER,
    order_index     INTEGER,
    line_number     INTEGER,
    column_number   INTEGER,
    filename        VARCHAR
);

Properties (from CPG spec): - NAME: Label name - PARSER_TYPE_NAME: Type of label construct - CODE: Label as it appears in source

Purpose: - Represent jump targets (goto labels, case labels) - Support control flow analysis with jumps - Enable goto-based code analysis - Track switch case targets

Examples:

// C goto label:
error_handler:
    cleanup();
    return -1;
// JUMP_TARGET: name="error_handler", parser_type_name="Label"

// Switch case label:
switch (x) {
    case 42:  // JUMP_TARGET: name="case_42"
        break;
}

Note: Modern languages discourage goto, but it’s common in C/assembly.

nodes_type_param

Type parameters (generics/templates formal parameters).

CREATE TABLE nodes_type_param (
    id                  BIGINT NOT NULL,
    name                VARCHAR NOT NULL,
    constraint_type     VARCHAR,
    "index"             INTEGER,
    line_number         INTEGER,
    column_number       INTEGER,
    filename            VARCHAR,
    method_id           BIGINT,
    type_decl_id        BIGINT
);

Properties (from CPG spec): - NAME: Type parameter name (e.g., “T”, “K”, “V”) - CONSTRAINT_TYPE: Upper bound constraint (e.g., “Comparable” for T extends Comparable) - INDEX: Position among sibling type parameters - METHOD_ID: Owning method (for method-level generics) - TYPE_DECL_ID: Owning type declaration (for class-level generics)

Purpose: - Formal type parameters in generic/template declarations - Java Generics: class List<T> - C++ Templates: template<typename T> - C# Generics: class Dictionary<TKey, TValue>

Examples:

// Java:
class Box<T> {  // TYPE_PARAMETER: name="T"
    T value;
}

// C++:
template<typename K, typename V>
// TYPE_PARAMETER: name="K"
// TYPE_PARAMETER: name="V"
class Map { ... }

Note: TYPE_PARAMETER is the formal parameter, TYPE_ARGUMENT is the actual type used.

nodes_type_argument

Type arguments (generics/templates actual arguments).

CREATE TABLE nodes_type_argument (
    id              BIGINT NOT NULL,
    code            VARCHAR,
    order_index     INTEGER
);

Properties (from CPG spec): - CODE: Type argument code (e.g., “Integer”, “String”)

Purpose: - Actual type arguments in generic/template instantiations - Connects to TYPE_PARAMETER via BINDS_TO edge - Java: List<Integer> - “Integer” is TYPE_ARGUMENT - C++: vector<int> - “int” is TYPE_ARGUMENT

Examples:

List<Integer> list = new ArrayList<String>();
// TYPE_ARGUMENT: code="Integer" (for List)
// TYPE_ARGUMENT: code="String" (for ArrayList)

Map<String, Integer> map;
// TYPE_ARGUMENT: code="String"
// TYPE_ARGUMENT: code="Integer"

Note: TYPE_ARGUMENT instances bind to TYPE_PARAMETER declarations.

nodes_binding

Name-signature bindings (method resolution).

CREATE TABLE nodes_binding (
    id                  BIGINT NOT NULL,
    name                VARCHAR NOT NULL,
    signature           VARCHAR,
    method_full_name    VARCHAR
);

Properties (from CPG spec): - NAME: Method name - SIGNATURE: Method signature - METHOD_FULL_NAME: Fully-qualified resolved method name

Purpose: - Resolve (name, signature) pairs at type declarations - Support virtual method dispatch - Enable polymorphism analysis - Connect TYPE_DECL to bound methods

Examples:

// Type declaration with methods:
class Animal {
    void speak() { }  // BINDING: name="speak", signature="void()"
}

class Dog extends Animal {
    @Override
    void speak() { }  // BINDING: name="speak", signature="void()"
}

// BINDING nodes allow resolving which speak() is called

Note: BINDING connects TYPE_DECL to METHOD via BINDS and REF edges.

nodes_closure_binding

Closure variable capture (lambda/closure bindings).

CREATE TABLE nodes_closure_binding (
    id                    BIGINT NOT NULL,
    closure_binding_id    VARCHAR,
    closure_original_name VARCHAR,
    evaluation_strategy   VARCHAR,
    filename              VARCHAR
);

Properties (from CPG spec): - CLOSURE_BINDING_ID: Unique identifier for this capture - EVALUATION_STRATEGY: How variable is captured - CODE: Captured variable name

Purpose: - Represent variable capture in closures/lambdas - Connect captured LOCAL/PARAM to closure - Support closure analysis - Enable escape analysis

Examples:

function outer(x) {
    let y = 10;
    return function inner() {
        console.log(x + y);  // x and y are captured
    };
}
// CLOSURE_BINDING for x: closure_binding_id="outer.inner.x"
// CLOSURE_BINDING for y: closure_binding_id="outer.inner.y"

// Java lambda:
int multiplier = 2;
list.forEach(item -> item * multiplier);
// CLOSURE_BINDING for multiplier

Note: CLOSURE_BINDING connects to LOCAL via CAPTURED_BY and to METHOD_REF via CAPTURE.

nodes_comment

Source code comments.

CREATE TABLE nodes_comment (
    id                      BIGINT NOT NULL,
    code                    VARCHAR NOT NULL,
    filename                VARCHAR,
    hash                    VARCHAR,
    line_number             INTEGER,
    line_number_end         INTEGER,
    column_number           INTEGER,
    column_number_end       INTEGER,
    "offset"                INTEGER,
    "offset_end"            INTEGER,
    order_index             INTEGER,
    containing_method_id    BIGINT,
    comment_type            VARCHAR,
    documented_node_id      BIGINT,
    binding_type            VARCHAR
);

Properties (from CPG spec): - CODE: Comment text (including delimiters) - FILENAME: Source file containing comment - OFFSET/OFFSET_END: Precise location

Purpose: - Preserve source code comments - Documentation extraction - Code annotation analysis - Comment-based security markers

Examples:

// Single-line comment
// COMMENT: code="// Single-line comment"

/* Multi-line
   comment */
// COMMENT: code="/* Multi-line\n   comment */"

/** JavaDoc comment
  * @param x Parameter description
  */
// COMMENT: code="/** JavaDoc...*/"

Note: Comments are AST nodes connected to FILE via AST edges. COMMENT_TYPE classifies comments (TODO, FIXME, HACK, XXX, DOCSTRING). DOCUMENTED_NODE_ID links to the node this comment documents.

nodes_modifier

Access modifiers (CPG schema compatibility).

CREATE TABLE nodes_modifier (
    id              BIGINT NOT NULL,
    modifier_type   VARCHAR NOT NULL,
    code            VARCHAR,
    order_index     INTEGER
);

Properties: - MODIFIER_TYPE: Modifier kind (STATIC, PUBLIC, PROTECTED, PRIVATE, ABSTRACT, NATIVE, CONSTRUCTOR, VIRTUAL, INTERNAL, FINAL, READONLY, MODULE)

nodes_annotation

Method/class annotations and decorators.

CREATE TABLE nodes_annotation (
    id              BIGINT NOT NULL,
    name            VARCHAR NOT NULL,
    full_name       VARCHAR,
    code            VARCHAR,
    order_index     INTEGER,
    line_number     INTEGER,
    column_number   INTEGER,
    filename        VARCHAR
);

Properties: - NAME: Annotation name (e.g., “Override”, “Deprecated”) - FULL_NAME: Fully-qualified annotation name

nodes_type

Type instances (CPG schema compatibility).

CREATE TABLE nodes_type (
    id                  BIGINT NOT NULL,
    name                VARCHAR NOT NULL,
    full_name           VARCHAR NOT NULL,
    type_decl_full_name VARCHAR
);

Properties: - NAME, FULL_NAME: Type identification - TYPE_DECL_FULL_NAME: Link to the corresponding TYPE_DECL

nodes_jump_label

Jump labels (for goto statements).

CREATE TABLE nodes_jump_label (
    id              BIGINT NOT NULL,
    name            VARCHAR NOT NULL,
    code            VARCHAR,
    order_index     INTEGER,
    line_number     INTEGER,
    column_number   INTEGER,
    filename        VARCHAR
);

nodes_tag_v2

External context tags (enrichment system).

CREATE TABLE nodes_tag_v2 (
    id              BIGINT NOT NULL,
    name            VARCHAR NOT NULL,
    value           VARCHAR,
    external_source VARCHAR,
    external_id     VARCHAR,
    external_url    VARCHAR,
    confidence      FLOAT,
    metadata        JSON,
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at      TIMESTAMP
);

Properties: - NAME: Tag category - VALUE: Tag value - EXTERNAL_SOURCE: Source system (e.g., “domain_config”, “vcs”, “enrichment”) - CONFIDENCE: Tag confidence score (0.0-1.0)

nodes_finding

Static analysis findings (security, quality).

CREATE TABLE nodes_finding (
    id          BIGINT NOT NULL,
    title       VARCHAR NOT NULL,
    description VARCHAR,
    severity    VARCHAR,
    category    VARCHAR,
    confidence  FLOAT DEFAULT 1.0,
    source      VARCHAR,
    rule_id     VARCHAR,
    status      VARCHAR DEFAULT 'open',
    metadata    JSON,
    created_at  TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at  TIMESTAMP
);

Properties: - TITLE: Finding summary - SEVERITY: error, warning, info, hint - CATEGORY: Finding category (security, quality, performance) - SOURCE: Generator (e.g., “pattern_match”, “finding_generation”) - RULE_ID: Pattern rule ID (if generated by pattern scan) - STATUS: open, resolved, suppressed

nodes_macro

C preprocessor macros.

CREATE TABLE nodes_macro (
    id                  BIGINT NOT NULL,
    name                VARCHAR NOT NULL,
    body                VARCHAR,
    is_function_like    BOOLEAN DEFAULT FALSE,
    line_number         INTEGER,
    column_number       INTEGER,
    filename            VARCHAR
);

Properties: - NAME: Macro name - BODY: Macro expansion body - IS_FUNCTION_LIKE: Whether the macro has parameters (e.g., #define MAX(a,b))

nodes_macro_param

Parameters of function-like macros.

CREATE TABLE nodes_macro_param (
    id                  BIGINT NOT NULL,
    name                VARCHAR NOT NULL,
    index_              INTEGER,
    line_number         INTEGER,
    column_number       INTEGER,
    filename            VARCHAR,
    macro_id            BIGINT
);

Properties: - NAME: Parameter name - INDEX_: Position in the macro parameter list - MACRO_ID: ID of the containing macro

nodes_namespace

Namespace index nodes (distinct from NAMESPACE_BLOCK).

CREATE TABLE nodes_namespace (
    id              BIGINT NOT NULL,
    name            VARCHAR NOT NULL,
    code            VARCHAR,
    order_index     INTEGER,
    line_number     INTEGER,
    column_number   INTEGER,
    filename        VARCHAR
);

Note: NAMESPACE is the index node, NAMESPACE_BLOCK is the scope. NAMESPACE nodes are auto-generated from NAMESPACE_BLOCK nodes.

nodes_annotation_literal

Literal values in annotations.

CREATE TABLE nodes_annotation_literal (
    id              BIGINT NOT NULL,
    name            VARCHAR,
    code            VARCHAR,
    order_index     INTEGER,
    argument_index  INTEGER,
    line_number     INTEGER,
    column_number   INTEGER,
    filename        VARCHAR
);

nodes_annotation_parameter

Formal annotation parameters.

CREATE TABLE nodes_annotation_parameter (
    id              BIGINT NOT NULL,
    code            VARCHAR,
    order_index     INTEGER,
    line_number     INTEGER,
    column_number   INTEGER,
    filename        VARCHAR
);

nodes_annotation_parameter_assign

Annotation argument-to-parameter mapping.

CREATE TABLE nodes_annotation_parameter_assign (
    id              BIGINT NOT NULL,
    code            VARCHAR,
    order_index     INTEGER,
    line_number     INTEGER,
    column_number   INTEGER,
    filename        VARCHAR
);

nodes_key_value_pair

Key-value pairs for findings.

CREATE TABLE nodes_key_value_pair (
    id              BIGINT NOT NULL,
    key             VARCHAR NOT NULL,
    value           VARCHAR
);

nodes_location

Source code location summary.

CREATE TABLE nodes_location (
    id                  BIGINT NOT NULL,
    class_name          VARCHAR,
    class_short_name    VARCHAR,
    method_short_name   VARCHAR,
    node_label          VARCHAR,
    package_name        VARCHAR,
    symbol              VARCHAR,
    filename            VARCHAR,
    line_number         INTEGER,
    method_full_name    VARCHAR
);

nodes_collection_decl

Named collection declarations (dict, list, set, tuple, enum).

CREATE TABLE nodes_collection_decl (
    id                BIGINT NOT NULL,
    parent_id         BIGINT,
    class_id          BIGINT,
    name              VARCHAR,
    full_name         VARCHAR,
    collection_type   VARCHAR,
    element_count     INTEGER,
    filename          VARCHAR,
    line_number       INTEGER,
    keys_json         VARCHAR,
    value_type_hint   VARCHAR
);

Properties: - NAME: Variable name (e.g. CWE_DATABASE) - FULL_NAME: FQN (e.g. src/security/kb.py:CWE_DATABASE) - COLLECTION_TYPE: dict, list, set, tuple, enum - ELEMENT_COUNT: Number of top-level elements - KEYS_JSON: JSON array of dict keys (max 256, dict only) - VALUE_TYPE_HINT: Inferred value type if homogeneous

Currently supported: Python frontend only. JS/TS/Go/Java — planned.

nodes_config_file

Configuration file content.

CREATE TABLE nodes_config_file (
    id              BIGINT NOT NULL,
    name            VARCHAR NOT NULL,
    content         VARCHAR
);

nodes_import

Import/include statements.

CREATE TABLE nodes_import (
    id              BIGINT NOT NULL,
    imported_entity VARCHAR NOT NULL,
    imported_as     VARCHAR,
    is_wildcard     BOOLEAN DEFAULT FALSE,
    is_explicit     BOOLEAN DEFAULT TRUE,
    code            VARCHAR,
    line_number     INTEGER,
    column_number   INTEGER,
    filename        VARCHAR
);

Properties: - IMPORTED_ENTITY: What is imported (e.g., “fmt”, “os.path”, “stdio.h”) - IMPORTED_AS: Alias (e.g., import f "fmt" -> imported_as=”f”) - IS_WILDCARD: Wildcard import (e.g., import . "pkg", from os import *) - IS_EXPLICIT: Explicit import (vs. implicit)

Examples:

Go:      import "fmt"                  -> imported_entity="fmt"
Go:      import f "fmt"               -> imported_entity="fmt", imported_as="f"
C:       #include <stdio.h>           -> imported_entity="stdio.h"
Python:  from os import path           -> imported_entity="os.path"
JS:      import { foo } from 'bar'     -> imported_entity="bar"

Edge Tables

edges_ast

Abstract Syntax Tree edges (parent-child relationships).

CREATE TABLE edges_ast (
    src BIGINT NOT NULL,
    dst BIGINT NOT NULL
);

CREATE INDEX idx_ast_src ON edges_ast(src);
CREATE INDEX idx_ast_dst ON edges_ast(dst);

edges_cfg

Control Flow Graph edges.

CREATE TABLE edges_cfg (
    src BIGINT NOT NULL,
    dst BIGINT NOT NULL
);

CREATE INDEX idx_cfg_src ON edges_cfg(src);
CREATE INDEX idx_cfg_dst ON edges_cfg(dst);

edges_call

Call site to method declaration edges.

CREATE TABLE edges_call (
    src              BIGINT NOT NULL,
    dst              BIGINT NOT NULL,
    cross_language   BOOLEAN DEFAULT FALSE,
    binding_type     VARCHAR DEFAULT ''
);

CREATE INDEX idx_call_edge_src ON edges_call(src);
CREATE INDEX idx_call_edge_dst ON edges_call(dst);

Properties: - CROSS_LANGUAGE: Whether this is a cross-language call (e.g., CGO Go->C, ctypes Python->C) - BINDING_TYPE: How the call was resolved (e.g., “exact”, “name_only”, “import_path”)

edges_ref

Reference edges (identifier to declaration).

CREATE TABLE edges_ref (
    src BIGINT NOT NULL, -- IDENTIFIER/CALL node id
    dst BIGINT NOT NULL  -- DECLARATION node id (LOCAL, PARAM, METHOD, TYPE_DECL)
);

CREATE INDEX idx_ref_src ON edges_ref(src);
CREATE INDEX idx_ref_dst ON edges_ref(dst);

edges_reaching_def

Data flow edges (reaching definitions).

CREATE TABLE edges_reaching_def (
    src BIGINT NOT NULL,
    dst BIGINT NOT NULL,
    variable VARCHAR  -- Variable name
);

CREATE INDEX idx_reaching_def_src ON edges_reaching_def(src);
CREATE INDEX idx_reaching_def_dst ON edges_reaching_def(dst);
CREATE INDEX idx_reaching_def_variable ON edges_reaching_def(variable);

Properties (from CPG spec): - VARIABLE: Variable name being tracked

edges_argument

Argument edges (call to argument expressions, return to returned expression).

CREATE TABLE edges_argument (
    src             BIGINT NOT NULL,
    dst             BIGINT NOT NULL,
    argument_index  INTEGER,
    argument_name   VARCHAR
);

CREATE INDEX idx_argument_src ON edges_argument(src);
CREATE INDEX idx_argument_dst ON edges_argument(dst);

Properties: - ARGUMENT_INDEX: Position of argument in argument list - ARGUMENT_NAME: Named argument name (for languages with keyword arguments)

edges_receiver

Receiver edges (call to receiver object).

CREATE TABLE edges_receiver (
    src BIGINT NOT NULL, -- CALL node id
    dst BIGINT NOT NULL  -- Receiver expression id
);

edges_condition

Condition edges (control structure to conditional expression).

CREATE TABLE edges_condition (
    src BIGINT NOT NULL, -- CONTROL_STRUCTURE node id
    dst BIGINT NOT NULL  -- Expression node id
);

edges_dominate

Immediate dominator edges (control flow domination).

CREATE TABLE edges_dominate (
    src BIGINT NOT NULL,
    dst BIGINT NOT NULL
);

CREATE INDEX idx_dominate_src ON edges_dominate(src);
CREATE INDEX idx_dominate_dst ON edges_dominate(dst);

edges_post_dominate

Post-dominator edges.

CREATE TABLE edges_post_dominate (
    src BIGINT NOT NULL,
    dst BIGINT NOT NULL
);

CREATE INDEX idx_post_dominate_src ON edges_post_dominate(src);
CREATE INDEX idx_post_dominate_dst ON edges_post_dominate(dst);

edges_cdg

Control Dependence Graph edges (CRITICAL for PDG).

CREATE TABLE edges_cdg (
    src BIGINT NOT NULL, -- Control structure node id (condition/branch)
    dst BIGINT NOT NULL  -- Dependent node id (code that depends on condition)
);

Properties (from CPG spec): - CDG edge means: dst is control-dependent on src - Essential for Program Dependence Graph (PDG = DDG + CDG) - Used for program slicing, security analysis, compiler optimizations - Example: statements inside IF block are control-dependent on IF condition

edges_binds

Binding edges (name bindings).

CREATE TABLE edges_binds (
    src BIGINT NOT NULL, -- BINDING node id
    dst BIGINT NOT NULL  -- METHOD or TYPE_DECL node id
);

Properties (from CPG spec): - Connects BINDING nodes to their declarations - Used for variable/function name resolution - Example: import statement binds name to actual definition

edges_binds_to

Reverse binding edges (name uses).

CREATE TABLE edges_binds_to (
    src BIGINT NOT NULL, -- Variable/function reference node id
    dst BIGINT NOT NULL  -- BINDING node id
);

Properties (from CPG spec): - Reverse of BINDS edge - Connects uses of names to their bindings - Example: variable reference → binding → declaration

BINDS workflow:

Declaration (METHOD/TYPE_DECL)
     ↑
   BINDS
     |
  BINDING node (import/using statement)
     ↑
 BINDS_TO
     |
Reference (IDENTIFIER/CALL)

edges_source_file

Source file edges (node to file mapping).

CREATE TABLE edges_source_file (
    src BIGINT NOT NULL,  -- Any AST node id
    dst BIGINT NOT NULL   -- FILE node id
);

CREATE INDEX idx_source_file_src ON edges_source_file(src);
CREATE INDEX idx_source_file_dst ON edges_source_file(dst);

Properties (from CPG spec): - Connects nodes to their source FILE - Auto-created based on FILENAME properties - MUST NOT be created by language frontend - created automatically - One-to-one relationship: each node has exactly one source file

Purpose: - Map any code element back to its source file - Navigate from FILE to all contained elements - Support file-based queries and analysis - Enable IDE “go to file” functionality

Example:

METHOD node (id=100)  SOURCE_FILE  FILE node (id=1, name="main.c")
CALL node (id=200)  SOURCE_FILE  FILE node (id=1, name="main.c")
TYPE_DECL node (id=300)  SOURCE_FILE  FILE node (id=2, name="types.h")

Auto-creation logic: 1. Frontend sets FILENAME property on nodes (METHOD, TYPE_DECL, etc.) 2. CPG loader creates FILE nodes for unique filenames 3. CPG loader creates SOURCE_FILE edges from nodes to FILE nodes 4. Results in complete file→code mapping

edges_alias_of

Type alias edges.

CREATE TABLE edges_alias_of (
    src BIGINT,  -- TYPE_DECL node (alias)
    dst BIGINT   -- TYPE node (actual type)
);

Properties (from CPG spec): - Connects TYPE_DECL (alias) to TYPE (actual) - MUST NOT be created by frontend - auto-created from ALIAS_TYPE_FULL_NAME - One-to-one relationship

Purpose: - Represent type aliases (C typedef, using, type aliases) - Enable alias resolution - Support type synonym analysis

Examples:

// C typedef:
typedef int Integer;
// TYPE_DECL "Integer" --ALIAS_OF--> TYPE "int"

// C++ using:
using String = std::string;
// TYPE_DECL "String" --ALIAS_OF--> TYPE "std::string"

// Rust type alias:
type Result<T> = std::result::Result<T, Error>;
// TYPE_DECL "Result" --ALIAS_OF--> TYPE "std::result::Result"

Note: Auto-generated when CPG is loaded based on ALIAS_TYPE_FULL_NAME property.

edges_inherits_from

Type inheritance edges.

CREATE TABLE edges_inherits_from (
    src BIGINT,  -- TYPE_DECL node (derived)
    dst BIGINT   -- TYPE node (base)
);

CREATE INDEX idx_inherits_from_src ON edges_inherits_from(src);
CREATE INDEX idx_inherits_from_dst ON edges_inherits_from(dst);

Properties (from CPG spec): - Connects TYPE_DECL (derived) to TYPE (base) - MUST NOT be created by frontend - auto-created from INHERITS_FROM_TYPE_FULL_NAME - One-to-many relationship (multiple inheritance supported)

Purpose: - Represent class/interface inheritance - Enable polymorphism analysis - Support type hierarchy queries - Track inheritance chains

Examples:

// Java single inheritance:
class Dog extends Animal implements Comparable {
    ...
}
// TYPE_DECL "Dog" --INHERITS_FROM--> TYPE "Animal"
// TYPE_DECL "Dog" --INHERITS_FROM--> TYPE "Comparable"

// C++ multiple inheritance:
class D : public A, public B { };
// TYPE_DECL "D" --INHERITS_FROM--> TYPE "A"
// TYPE_DECL "D" --INHERITS_FROM--> TYPE "B"

Note: Auto-generated when CPG is loaded based on INHERITS_FROM_TYPE_FULL_NAME array.

edges_capture

Closure capture edges.

CREATE TABLE edges_capture (
    src             BIGINT NOT NULL,
    dst             BIGINT NOT NULL,
    variable_name   VARCHAR
);

Properties: - VARIABLE_NAME: Name of the captured variable

Properties (from CPG spec): - Connects METHOD_REF/TYPE_REF to CLOSURE_BINDING - Represents variable capture in closure/lambda - One-to-many relationship (closure can capture multiple variables)

Purpose: - Track which variables are captured by closures - Enable escape analysis - Support closure optimization - Identify captured variable lifetimes

Examples:

function outer() {
    let x = 10;
    let y = 20;
    return function inner() {
        return x + y;  // captures x and y
    };
}
// METHOD_REF "inner" --CAPTURE--> CLOSURE_BINDING for x
// METHOD_REF "inner" --CAPTURE--> CLOSURE_BINDING for y

Note: CAPTURE edge connects closure to its captured variables.

edges_contains

Structural containment edges (method to node).

CREATE TABLE edges_contains (
    src     BIGINT NOT NULL,
    dst     BIGINT NOT NULL
);

CREATE INDEX idx_contains_src ON edges_contains(src);
CREATE INDEX idx_contains_dst ON edges_contains(dst);

Properties: - Connects methods/files to their contained nodes - Used for structural queries (e.g., “find all calls in method X”)

edges_eval_type

Type evaluation edges (node to type).

CREATE TABLE edges_eval_type (
    src     BIGINT NOT NULL,
    dst     BIGINT NOT NULL
);

CREATE INDEX idx_eval_type_src ON edges_eval_type(src);
CREATE INDEX idx_eval_type_dst ON edges_eval_type(dst);

edges_tagged_by

Tag association edges (node to tag).

CREATE TABLE edges_tagged_by (
    src     BIGINT NOT NULL,
    dst     BIGINT NOT NULL
);

CREATE INDEX idx_tagged_by_src ON edges_tagged_by(src);
CREATE INDEX idx_tagged_by_dst ON edges_tagged_by(dst);

Parameter link edges (input to output param).

CREATE TABLE edges_parameter_link (
    src     BIGINT NOT NULL,
    dst     BIGINT NOT NULL
);

edges_pdg

Program Dependence Graph edges (CDG + data dependencies).

CREATE TABLE edges_pdg (
    src     BIGINT NOT NULL,
    dst     BIGINT NOT NULL
);

edges_ddg

Data Dependency Graph edges (CodeGraph compatibility).

CREATE TABLE edges_ddg (
    src     BIGINT NOT NULL,
    dst     BIGINT NOT NULL
);

edges_vtable

Virtual table edges (virtual method dispatch).

CREATE TABLE edges_vtable (
    src     BIGINT NOT NULL,
    dst     BIGINT NOT NULL
);

edges_documents

Comment to documented node edges.

CREATE TABLE edges_documents (
    src             BIGINT NOT NULL,
    dst             BIGINT NOT NULL,
    binding_type    VARCHAR
);

Properties: - BINDING_TYPE: How the comment was linked to the node (e.g., “proximity”, “docstring”) - Connects nodes_comment to the documented node (method, type, etc.)

Views

call_containment

Denormalized caller/callee view used by Python CodeGraph for architecture, retrieval, and workflow queries.

CREATE VIEW call_containment AS
SELECT DISTINCT
    caller.name AS containing_method_name,
    caller.full_name AS containing_method_full_name,
    callee.name AS callee_name,
    callee.full_name AS callee_full_name,
    caller.filename AS caller_filename,
    callee.filename AS callee_filename,
    nc.line_number AS call_line_number,
    ec.cross_language AS cross_language,
    ec.binding_type AS binding_type,
    caller_file.language AS caller_language,
    callee_file.language AS callee_language
FROM edges_call ec
JOIN nodes_call nc ON ec.src = nc.id
JOIN nodes_method caller ON nc.containing_method_id = caller.id
JOIN nodes_method callee ON ec.dst = callee.id
LEFT JOIN nodes_file caller_file ON caller.filename = caller_file.name
LEFT JOIN nodes_file callee_file ON callee.filename = callee_file.name;

Columns (11): - containing_method_name, containing_method_full_name: Caller method - callee_name, callee_full_name: Callee method - caller_filename, callee_filename: Source files - call_line_number: Line of the call site - cross_language: Whether call crosses language boundary - binding_type: How the call was resolved - caller_language, callee_language: Languages of caller and callee

method_docstrings

Maps comments to documented methods (used for function description extraction).

CREATE VIEW method_docstrings AS
SELECT
    m.id AS method_id,
    m.name AS method_name,
    m.full_name,
    m.filename,
    m.line_number AS method_line,
    c.code AS docstring,
    c.line_number AS comment_line
FROM edges_documents ed
JOIN nodes_comment c ON ed.src = c.id
JOIN nodes_method m ON ed.dst = m.id;

State Tables

Tables for incremental updates, FQN index, and git tracking.

cpg_fqn_index

Fast symbol resolution index (inspired by Yandex SourceCraft).

CREATE TABLE cpg_fqn_index (
    fqn             VARCHAR NOT NULL,
    node_id         BIGINT NOT NULL,
    node_type       VARCHAR NOT NULL,
    name            VARCHAR NOT NULL,
    signature       VARCHAR,
    filename        VARCHAR NOT NULL,
    visibility      VARCHAR DEFAULT 'public',
    param_count     INTEGER DEFAULT 0,
    return_type     VARCHAR,
    PRIMARY KEY (fqn, node_type)
);

cpg_file_state

Track file state for incremental updates.

CREATE TABLE cpg_file_state (
    filename    VARCHAR PRIMARY KEY,
    hash        VARCHAR NOT NULL,
    ast_hash    VARCHAR,
    mtime       BIGINT,
    parsed_at   BIGINT DEFAULT (epoch_ms(now())),
    branch_name VARCHAR
);

cpg_git_state

Track last known git commit.

CREATE TABLE cpg_git_state (
    id              INTEGER PRIMARY KEY DEFAULT 1,
    commit_hash     VARCHAR NOT NULL,
    branch          VARCHAR,
    updated_at      BIGINT DEFAULT (epoch_ms(now()))
);

cpg_branch_state

Track CPG state per branch for branch-aware updates.

CREATE TABLE cpg_branch_state (
    branch_name             VARCHAR PRIMARY KEY,
    last_commit_sha         VARCHAR(40) NOT NULL,
    last_commit_timestamp   BIGINT,
    file_count              INTEGER DEFAULT 0,
    node_count              INTEGER DEFAULT 0,
    is_active               BOOLEAN DEFAULT FALSE,
    created_at              BIGINT DEFAULT (epoch_ms(now())),
    updated_at              BIGINT DEFAULT (epoch_ms(now()))
);

cpg_submodule_state

Track git submodule state for submodule-aware updates.

CREATE TABLE cpg_submodule_state (
    submodule_path      VARCHAR(4096) PRIMARY KEY,
    submodule_url       VARCHAR(4096),
    current_commit_sha  VARCHAR(40),
    parent_commit_sha   VARCHAR(40),
    is_initialized      BOOLEAN DEFAULT FALSE,
    is_recursive        BOOLEAN DEFAULT FALSE,
    last_updated        BIGINT DEFAULT (epoch_ms(now()))
);

export_progress

Export progress for resumable imports.

CREATE TABLE export_progress (
    entity_type     VARCHAR PRIMARY KEY,
    total_count     BIGINT,
    exported_count  BIGINT,
    last_offset     BIGINT,
    status          VARCHAR,
    last_updated    BIGINT,
    error_message   VARCHAR
);

Domain and Pattern Tables

cpg_domain_annotations

Structured function-level domain metadata.

CREATE TABLE cpg_domain_annotations (
    id              BIGINT NOT NULL,
    method_id       BIGINT,
    function_name   VARCHAR NOT NULL,
    annotation_type VARCHAR NOT NULL,
    category        VARCHAR,
    value           VARCHAR,
    confidence      FLOAT DEFAULT 1.0,
    source          VARCHAR DEFAULT 'domain_config',
    metadata        JSON
);

cpg_domain_subsystems

Project organizational structure.

CREATE TABLE cpg_domain_subsystems (
    id           BIGINT NOT NULL,
    name         VARCHAR NOT NULL,
    display_name VARCHAR,
    description  VARCHAR,
    patterns     VARCHAR[],
    key_files    VARCHAR[],
    parent_id    BIGINT
);

cpg_domain_subsystem_members

Method to subsystem membership.

CREATE TABLE cpg_domain_subsystem_members (
    method_id    BIGINT NOT NULL,
    subsystem_id BIGINT NOT NULL,
    confidence   FLOAT DEFAULT 1.0,
    source       VARCHAR DEFAULT 'domain_config',
    PRIMARY KEY (method_id, subsystem_id)
);

finding_evidence

Links findings to CPG nodes with role-based evidence chain.

CREATE TABLE finding_evidence (
    id          BIGINT NOT NULL,
    finding_id  BIGINT NOT NULL,
    node_id     BIGINT,
    role        VARCHAR,
    description VARCHAR,
    filename    VARCHAR,
    line_number INTEGER,
    code        VARCHAR
);

cpg_enrichment_state

Tracks what enrichments have been computed.

CREATE TABLE cpg_enrichment_state (
    enrichment_type VARCHAR NOT NULL,
    source          VARCHAR NOT NULL,
    scope           VARCHAR DEFAULT 'global',
    last_computed   TIMESTAMP,
    version         VARCHAR,
    item_count      INTEGER,
    needs_refresh   BOOLEAN DEFAULT FALSE,
    PRIMARY KEY (enrichment_type, source, scope)
);

cpg_enrichment_anchors

FQN-based stable identity for enrichment tags across updates.

CREATE TABLE cpg_enrichment_anchors (
    tag_id       BIGINT NOT NULL,
    anchor_fqn   VARCHAR NOT NULL,
    anchor_hash  VARCHAR,
    filename     VARCHAR,
    source       VARCHAR,
    PRIMARY KEY (tag_id)
);

cpg_pattern_results

Pattern matching results (CPG-aware structural search).

CREATE TABLE cpg_pattern_results (
    id              BIGINT PRIMARY KEY,
    rule_id         VARCHAR NOT NULL,
    node_id         BIGINT NOT NULL,
    filename        VARCHAR NOT NULL,
    line_number     INTEGER NOT NULL,
    column_number   INTEGER DEFAULT 0,
    code            VARCHAR,
    message         VARCHAR,
    severity        VARCHAR NOT NULL,
    category        VARCHAR,
    confidence      DOUBLE DEFAULT 1.0,
    match_data      VARCHAR,
    cpg_context     VARCHAR,
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

cpg_pattern_rules

Pattern matching rules loaded for scan.

CREATE TABLE cpg_pattern_rules (
    rule_id         VARCHAR PRIMARY KEY,
    language        VARCHAR NOT NULL,
    severity        VARCHAR NOT NULL,
    category        VARCHAR,
    has_cpg         BOOLEAN DEFAULT FALSE,
    rule_source     VARCHAR,
    loaded_at       TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

cpg_pattern_deps

Pattern dependency tracking for incremental invalidation.

CREATE TABLE cpg_pattern_deps (
    rule_id       VARCHAR NOT NULL,
    filename      VARCHAR NOT NULL,
    dep_filename  VARCHAR NOT NULL,
    dep_type      VARCHAR NOT NULL,
    PRIMARY KEY (rule_id, filename, dep_filename, dep_type)
);

Property Graph Definition

Using duckpgq to create a unified property graph with full CPG schema support.

KEY INSIGHT: DuckDB PGQ does not support views in VERTEX TABLES, but we can use materialized tables instead! This allows full support for polymorphic edges.

Step 1: Create Materialized Unified Nodes Table

Important: Use CREATE TABLE (not CREATE VIEW) to materialize the unified node set.

-- Create materialized unified table of all CPG nodes for polymorphic edge support
DROP TABLE IF EXISTS cpg_nodes;

CREATE TABLE cpg_nodes AS
SELECT id, 'FILE' as node_type FROM nodes_file
UNION ALL SELECT id, 'NAMESPACE_BLOCK' FROM nodes_namespace_block
UNION ALL SELECT id, 'METHOD' FROM nodes_method
UNION ALL SELECT id, 'METHOD_REF' FROM nodes_method_ref
UNION ALL SELECT id, 'CALL' FROM nodes_call
UNION ALL SELECT id, 'IDENTIFIER' FROM nodes_identifier
UNION ALL SELECT id, 'FIELD_IDENTIFIER' FROM nodes_field_identifier
UNION ALL SELECT id, 'LITERAL' FROM nodes_literal
UNION ALL SELECT id, 'LOCAL' FROM nodes_local
UNION ALL SELECT id, 'PARAM' FROM nodes_param
UNION ALL SELECT id, 'PARAM_OUT' FROM nodes_method_parameter_out
UNION ALL SELECT id, 'METHOD_RETURN' FROM nodes_method_return
UNION ALL SELECT id, 'RETURN' FROM nodes_return
UNION ALL SELECT id, 'BLOCK' FROM nodes_block
UNION ALL SELECT id, 'CONTROL_STRUCTURE' FROM nodes_control_structure
UNION ALL SELECT id, 'MEMBER' FROM nodes_member
UNION ALL SELECT id, 'TYPE_DECL' FROM nodes_type_decl
UNION ALL SELECT id, 'TYPE_REF' FROM nodes_type_ref
UNION ALL SELECT id, 'TYPE_PARAM' FROM nodes_type_param
UNION ALL SELECT id, 'TYPE_ARGUMENT' FROM nodes_type_argument
UNION ALL SELECT id, 'UNKNOWN' FROM nodes_unknown
UNION ALL SELECT id, 'JUMP_TARGET' FROM nodes_jump_target
UNION ALL SELECT id, 'BINDING' FROM nodes_binding
UNION ALL SELECT id, 'CLOSURE_BINDING' FROM nodes_closure_binding
UNION ALL SELECT id, 'COMMENT' FROM nodes_comment
UNION ALL SELECT id, 'COLLECTION_DECL' FROM nodes_collection_decl;

-- Add primary key and indexes
ALTER TABLE cpg_nodes ADD PRIMARY KEY (id);
CREATE INDEX idx_cpg_nodes_type ON cpg_nodes(node_type);

Step 2: Create Comprehensive Property Graph

Full Implementation (with ALL edge types including polymorphic edges):

CREATE PROPERTY GRAPH cpg
VERTEX TABLES (
    -- Materialized unified node table for polymorphic edge support
    cpg_nodes LABEL CPG_NODE,

    -- Individual typed node tables for specific queries
    nodes_file LABEL FILE_NODE,
    nodes_namespace_block LABEL NAMESPACE_BLOCK,
    nodes_method LABEL METHOD,
    nodes_method_ref LABEL METHOD_REF,
    nodes_call LABEL CALL_NODE,
    nodes_identifier LABEL IDENTIFIER,
    nodes_field_identifier LABEL FIELD_IDENTIFIER,
    nodes_literal LABEL LITERAL,
    nodes_local LABEL LOCAL,
    nodes_param LABEL PARAM,
    nodes_method_parameter_out LABEL PARAM_OUT,
    nodes_method_return LABEL METHOD_RETURN,
    nodes_return LABEL RETURN_NODE,
    nodes_block LABEL BLOCK,
    nodes_control_structure LABEL CONTROL_STRUCTURE,
    nodes_member LABEL MEMBER,
    nodes_type_decl LABEL TYPE_DECL,
    nodes_type_ref LABEL TYPE_REF,
    nodes_type_param LABEL TYPE_PARAM,
    nodes_type_argument LABEL TYPE_ARGUMENT,
    nodes_unknown LABEL UNKNOWN,
    nodes_jump_target LABEL JUMP_TARGET,
    nodes_binding LABEL BINDING,
    nodes_closure_binding LABEL CLOSURE_BINDING,
    nodes_comment LABEL COMMENT_NODE,
    nodes_metadata LABEL METADATA
)
EDGE TABLES (
    -- ========================================
    -- POLYMORPHIC EDGES (via cpg_nodes table)
    -- ========================================

    edges_ast
        SOURCE KEY (src) REFERENCES cpg_nodes (id)
        DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
        LABEL AST,

    edges_cfg
        SOURCE KEY (src) REFERENCES cpg_nodes (id)
        DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
        LABEL CFG,

    edges_ref
        SOURCE KEY (src) REFERENCES cpg_nodes (id)
        DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
        LABEL REF,

    edges_reaching_def
        SOURCE KEY (src) REFERENCES cpg_nodes (id)
        DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
        LABEL REACHING_DEF,

    edges_argument
        SOURCE KEY (src) REFERENCES cpg_nodes (id)
        DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
        LABEL ARGUMENT,

    edges_dominate
        SOURCE KEY (src) REFERENCES cpg_nodes (id)
        DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
        LABEL DOMINATE,

    edges_post_dominate
        SOURCE KEY (src) REFERENCES cpg_nodes (id)
        DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
        LABEL POST_DOMINATE,

    edges_cdg
        SOURCE KEY (src) REFERENCES cpg_nodes (id)
        DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
        LABEL CDG,

    edges_binds
        SOURCE KEY (src) REFERENCES cpg_nodes (id)
        DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
        LABEL BINDS,

    edges_binds_to
        SOURCE KEY (src) REFERENCES cpg_nodes (id)
        DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
        LABEL BINDS_TO,

    -- ========================================
    -- TYPED EDGES (specific source/destination)
    -- ========================================

    edges_call
        SOURCE KEY (src) REFERENCES nodes_call (id)
        DESTINATION KEY (dst) REFERENCES nodes_method (id)
        LABEL CALLS,

    edges_receiver
        SOURCE KEY (src) REFERENCES nodes_call (id)
        DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
        LABEL RECEIVER,

    edges_condition
        SOURCE KEY (src) REFERENCES nodes_control_structure (id)
        DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
        LABEL CONDITION,

    edges_source_file
        SOURCE KEY (src) REFERENCES cpg_nodes (id)
        DESTINATION KEY (dst) REFERENCES nodes_file (id)
        LABEL SOURCE_FILE,

    edges_alias_of
        SOURCE KEY (src) REFERENCES nodes_type_decl (id)
        DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
        LABEL ALIAS_OF,

    edges_inherits_from
        SOURCE KEY (src) REFERENCES nodes_type_decl (id)
        DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
        LABEL INHERITS_FROM,

    edges_capture
        SOURCE KEY (src) REFERENCES cpg_nodes (id)
        DESTINATION KEY (dst) REFERENCES nodes_closure_binding (id)
        LABEL CAPTURE,

    edges_documents
        SOURCE KEY (src) REFERENCES cpg_nodes (id)
        DESTINATION KEY (dst) REFERENCES cpg_nodes (id)
        LABEL DOCUMENTS
);

Key Features: - ✅ Full CPG schema support with ALL edge types - ✅ Polymorphic edges (AST, CFG, REF, etc.) via materialized cpg_nodes table - ✅ Typed vertices for efficient targeted queries - ✅ DuckDB PGQ compatible (uses tables, not views) - ✅ 100% CPG spec v1.1 compliant

Example Queries

Standard SQL Query: Find all calls to a specific method

SELECT c.*, m.name, m.filename, m.line_number
FROM nodes_call c
JOIN edges_call ec ON c.id = ec.src
JOIN nodes_method m ON ec.dst = m.id
WHERE m.full_name = 'com.example.MyClass.myMethod:void()';

DuckDB PGQ Query: Find direct call chains (caller -> callee)

SELECT *
FROM GRAPH_TABLE(cpg
    MATCH (caller:METHOD)-[:CALLS]-(call:CALL_NODE)-[:CALLS]->(callee:METHOD)
    WHERE caller.name = 'main'
    COLUMNS (caller.full_name AS caller_name, callee.full_name AS callee_name)
);

DuckDB PGQ Query: Find methods and their AST children

SELECT *
FROM GRAPH_TABLE(cpg
    MATCH (m:METHOD)-[:AST]->(child:CPG_NODE)
    COLUMNS (m.full_name AS method, child.id AS child_id)
);

DuckDB PGQ Query: Data flow paths using REACHING_DEF

SELECT *
FROM GRAPH_TABLE(cpg
    MATCH (source:CPG_NODE)-[:REACHING_DEF*1..5]->(sink:CPG_NODE)
    COLUMNS (source.id, sink.id)
)
LIMIT 100;

DuckDB PGQ Query: CFG paths (control flow)

SELECT *
FROM GRAPH_TABLE(cpg
    MATCH (start:CPG_NODE)-[:CFG*1..3]->(end:CPG_NODE)
    COLUMNS (start.id AS start_node, end.id AS end_node)
)
LIMIT 100;

DuckDB PGQ Query: Find all identifiers and their references

SELECT *
FROM GRAPH_TABLE(cpg
    MATCH (id:IDENTIFIER)-[:REF]->(decl:CPG_NODE)
    COLUMNS (id.name AS identifier_name, decl.id AS declaration_id)
);

DuckDB PGQ Query: Type hierarchy (inheritance)

SELECT *
FROM GRAPH_TABLE(cpg
    MATCH (derived:TYPE_DECL)-[:INHERITS_FROM]->(base:CPG_NODE)
    COLUMNS (derived.full_name AS derived_type, base.id AS base_type)
);

Combined Query: Methods with most incoming calls

SELECT m.full_name, COUNT(*) as call_count
FROM GRAPH_TABLE(cpg
    MATCH (c:CALL_NODE)-[:CALLS]->(m:METHOD)
    COLUMNS (m.full_name, m.id)
)
GROUP BY m.full_name
ORDER BY call_count DESC
LIMIT 10;

Performance Considerations

  1. Batching: Use 10,000 row batches for INSERT operations
  2. Indexing: Create indexes after bulk load for faster imports
  3. Partitioning: Consider partitioning by filename for large codebases
  4. Compression: Use DuckDB’s automatic compression for TEXT columns
  5. Memory: Configure DuckDB memory limit based on CPG size

Schema Version

  • CPG Spec: v1.1
  • DuckDB: 1.4.2+
  • duckpgq: Latest stable
  • Schema Version: 7.0 (Full schema.go alignment)
  • Last Updated: 2026-02-28

Changelog

v7.0 (2026-02-28) - Full schema.go alignment

MAJOR UPDATE: All table definitions now match gocpg/pkg/storage/duckdb/schema.go exactly

New node tables: - nodes_modifier - Access modifier nodes - nodes_annotation - Method/class annotations - nodes_type - Type instances - nodes_jump_label - Jump labels (goto) - nodes_tag_v2 - External context tags (replaces legacy nodes_tag) - nodes_finding - Static analysis findings - nodes_macro - C preprocessor macros - nodes_macro_param - Macro parameters - nodes_namespace - Namespace index nodes - nodes_annotation_literal - Annotation literal values - nodes_annotation_parameter - Formal annotation parameters - nodes_annotation_parameter_assign - Annotation parameter assignments - nodes_key_value_pair - Key-value pairs for findings - nodes_location - Source code location summary - nodes_config_file - Configuration file content - nodes_import - Import/include statements - nodes_closure_binding - Closure variable capture

New edge tables: - edges_contains - Structural containment - edges_eval_type - Type evaluation - edges_tagged_by - Tag association - edges_parameter_link - Parameter links - edges_pdg - Program Dependence Graph - edges_ddg - Data Dependency Graph - edges_vtable - Virtual table entries - edges_documents - Comment documentation links

New views: - call_containment - Denormalized caller/callee view (11 columns) - method_docstrings - Comment to method mapping

New state/domain/pattern tables: - cpg_fqn_index, cpg_file_state, cpg_git_state, cpg_branch_state, cpg_submodule_state, export_progress - cpg_domain_annotations, cpg_domain_subsystems, cpg_domain_subsystem_members - finding_evidence, cpg_enrichment_state, cpg_enrichment_anchors - cpg_pattern_results, cpg_pattern_rules, cpg_pattern_deps

Column changes: - nodes_method: Added is_nested, loc, parameter_count, embedding*, ast_hash - nodes_call: Added containing_method_id, callee_method_id, type_origin, type_confidence, embedding* - nodes_param: Added method_id, parent_param_id, typedef_id, struct_member_id - nodes_method_return: Added method_id - nodes_comment: Added hash, line_number_end, column_number_end, containing_method_id, comment_type, documented_node_id, binding_type - edges_call: Added cross_language, binding_type - edges_argument: Added argument_index, argument_name - edges_capture: Added variable_name - edges_documents: Added binding_type

Table renames: - nodes_type_parameter -> nodes_type_param (with new columns: constraint_type, index, method_id, type_decl_id) - nodes_param_out -> nodes_method_parameter_out

v6.0 (2026-02-26) - Pre-computed Metrics & Pattern Flags

MAJOR UPDATE: Added pre-computed boolean flags and metrics to nodes_method for fast queries

New nodes_method Columns: - has_disabled_code (BOOLEAN) — Detects #if 0, if false, commented-out blocks - has_deprecated (BOOLEAN) — Detects @Deprecated, [[deprecated]] annotations - has_todo_fixme (BOOLEAN) — Detects TODO, FIXME, HACK, XXX comments - has_debug_code (BOOLEAN) — Detects debug prints, console.log, debugging statements - is_test (BOOLEAN) — Cross-language test function detection (11 languages) - is_entry_point (BOOLEAN) — Public API entry point detection (exported, HTTP handler, main) - cyclomatic_complexity (INTEGER) — McCabe cyclomatic complexity metric - fan_in (INTEGER) — Number of callers - fan_out (INTEGER) — Number of callees

Impact: - 10-100x faster pattern queries vs. LIKE scans on code/name columns - Cross-language test file detection (pytest, JUnit, Go testing, etc.) - Pre-computed metrics enable instant complexity/coupling analysis - DuckDB 1.4.2+ required for DuckPGQ graph query extension

v5.0 (2025-11-16) - Complete Compliance

MAJOR UPDATE: Achieved 100% CPG schema compliance with all remaining features

New Node Types: - nodes_unknown (UNKNOWN) - Catch-all for unsupported AST constructs - nodes_jump_target (JUMP_TARGET) - Labels for goto/break/continue statements - nodes_type_parameter (TYPE_PARAMETER) - Generic/template formal parameters - nodes_type_argument (TYPE_ARGUMENT) - Generic/template actual arguments - nodes_binding (BINDING) - Method resolution bindings - nodes_closure_binding (CLOSURE_BINDING) - Variable capture in closures - nodes_comment (COMMENT) - Source code comments

New Edge Types: - edges_alias_of (ALIAS_OF) - Type alias relationships - edges_inherits_from (INHERITS_FROM) - Type inheritance relationships - edges_capture (CAPTURE) - Closure variable capture - edges_captured_by (CAPTURED_BY) - Reverse closure capture

Impact: - 100% CPG schema compliance achieved - Complete AST coverage with UNKNOWN nodes - Full generic/template support - Closure and lambda analysis enabled - Type alias and inheritance resolution - Comment preservation for documentation - Complete program analysis capabilities

Compliance: ~95% → 100% CPG schema

v4.0 (2025-11-16) - Namespace and File Support

MAJOR UPDATE: Added file mapping, namespace support, and method/type references

New Node Types: - nodes_file (FILE) - Source file nodes for file-based indexing - nodes_namespace_block (NAMESPACE_BLOCK) - Namespace blocks (C++ namespace, Java package) - nodes_method_ref (METHOD_REF) - Method references (higher-order functions, function pointers) - nodes_type_ref (TYPE_REF) - Type references (reflection, type casting, generics)

New Edge Types: - edges_source_file (SOURCE_FILE) - Node to file mapping (auto-created from FILENAME)

Impact: - File-based code navigation enabled - Namespace-aware analysis supported - Higher-order function tracking (callbacks, delegates) - Type reflection and generics support - Complete source file mapping (IDE integration ready) - Cross-file dependency analysis improved

Compliance: ~90% → ~95% CPG schema compliance

v3.0 (2025-11-16) - OOP Support

MAJOR UPDATE: Added OOP analysis support and precise source mapping

New Node Types: - nodes_field_identifier (FIELD_IDENTIFIER) - Field access nodes for OOP - nodes_member (MEMBER) - Class/struct field declarations

New Edge Types: - edges_binds (BINDS) - Name binding edges (import/using statements) - edges_binds_to (BINDS_TO) - Reverse binding edges (name resolution)

New Properties: - OFFSET, OFFSET_END: Precise byte-level source mapping (added to METHOD, TYPE_DECL, IDENTIFIER, FIELD_IDENTIFIER, MEMBER) - MODIFIER: Access modifiers array (added to METHOD, TYPE_DECL) - CANONICAL_NAME: Normalized identifier name (added to FIELD_IDENTIFIER)

Impact: - OOP code analysis fully supported (field access tracking) - Precise source code location mapping (byte-level) - Visibility analysis enabled (PUBLIC, PRIVATE, STATIC, etc.) - Variable/function name resolution improved - Alias analysis enabled (canonical names)

Compliance: ~80% → ~90% CPG schema compliance

v2.0 (2025-11-16) - Critical Updates

MAJOR UPDATE: Added critical components for PDG and SSA analysis

New Node Types: - nodes_param_out (METHOD_PARAMETER_OUT) - Output parameters for SSA analysis - nodes_method_return (METHOD_RETURN) - Formal return parameter

New Edge Types: - edges_cdg (Control Dependence Graph) - Critical for PDG!

Impact: - PDG now complete (DDG + CDG) - SSA analysis now possible - Program slicing enabled - Security taint analysis improved

Compliance: ~70% → ~80% CPG schema compliance

v1.0 (2025-11-15) - Initial Release

  • 11 node types (METHOD, CALL, IDENTIFIER, LITERAL, LOCAL, PARAM, RETURN, BLOCK, CONTROL_STRUCTURE, TYPE_DECL, METADATA)
  • 10 edge types (AST, CFG, CALL, REF, REACHING_DEF, ARGUMENT, RECEIVER, CONDITION, DOMINATE, POST_DOMINATE)

Extension: Semantic Tag System

Overview

The tag system extends the CPG with semantic annotations for methods and other code elements. Tags are custom extensions (not part of CPG spec v1.1) that enable semantic search, code classification, and intelligent analysis.

nodes_tag_v2

Stores tag definitions with external source tracking and confidence scoring (semantic labels for code elements).

CREATE TABLE nodes_tag_v2 (
    id              BIGINT PRIMARY KEY,
    name            VARCHAR NOT NULL,
    value           VARCHAR,
    external_source VARCHAR,
    external_id     VARCHAR,
    external_url    VARCHAR,
    confidence      DOUBLE DEFAULT 1.0,
    metadata        VARCHAR,
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_tag_v2_name ON nodes_tag_v2(name);
CREATE INDEX idx_tag_v2_name_value ON nodes_tag_v2(name, value);
CREATE INDEX idx_tag_external ON nodes_tag_v2(external_source, external_id);

Properties: - NAME: Tag category identifier (unique semantic dimension) - VALUE: Tag value within that category - EXTERNAL_SOURCE: Origin system for the tag (e.g., ‘jira’, ‘sonarqube’) - EXTERNAL_ID: Identifier in the external system - EXTERNAL_URL: Link to the external resource - CONFIDENCE: Confidence score for the tag assignment (0.0-1.0, default 1.0) - METADATA: Additional JSON metadata for the tag

edges_tagged_by (Extension)

Connects code elements to their semantic tags.

CREATE TABLE edges_tagged_by (
    src BIGINT NOT NULL,       -- Source node id (typically METHOD)
    dst BIGINT NOT NULL        -- Tag node id (nodes_tag_v2)
);

CREATE INDEX idx_tagged_by_src ON edges_tagged_by(src);
CREATE INDEX idx_tagged_by_dst ON edges_tagged_by(dst);

Relationship: - Source: Any CPG node (typically nodes_method) - Destination: nodes_tag_v2 entry

Tag Categories

Category Description Example Values
subsystem-name Code organizational unit ‘executor’, ‘planner’, ‘parser’, ‘storage’
security-risk Security classification ‘critical’, ‘high’, ‘medium’, ‘low’
taint-source Untrusted data entry point Boolean tagging
taint-sink Security-sensitive output Boolean tagging
perf-hotspot Performance critical code Boolean tagging
allocation-heavy Memory-intensive methods Boolean tagging
test-coverage Has associated tests Boolean tagging
cyclomatic-complexity Code complexity metric Numeric values (e.g., ‘15’, ‘25’)
function-purpose Semantic description Free-form text description
entry-point System entry point Boolean tagging

Example Tag Queries

-- Find all security-critical methods
SELECT m.*
FROM nodes_method m
JOIN edges_tagged_by e ON m.id = e.src
JOIN nodes_tag_v2 t ON e.dst = t.id
WHERE t.name = 'security-risk' AND t.value = 'critical';

-- List all subsystems
SELECT DISTINCT t.value as subsystem
FROM nodes_tag_v2 t
WHERE t.name = 'subsystem-name'
ORDER BY t.value;

-- Find methods in a specific subsystem
SELECT m.full_name, m.filename, m.line_number
FROM nodes_method m
JOIN edges_tagged_by e ON m.id = e.src
JOIN nodes_tag_v2 t ON e.dst = t.id
WHERE t.name = 'subsystem-name' AND t.value = 'executor';

-- Get complexity hotspots
SELECT m.full_name, t.value as complexity
FROM nodes_method m
JOIN edges_tagged_by e ON m.id = e.src
JOIN nodes_tag_v2 t ON e.dst = t.id
WHERE t.name = 'cyclomatic-complexity'
  AND CAST(t.value AS INTEGER) > 20
ORDER BY CAST(t.value AS INTEGER) DESC;

-- Count tags by category
SELECT t.name, COUNT(*) as count
FROM nodes_tag_v2 t
JOIN edges_tagged_by e ON t.id = e.dst
GROUP BY t.name
ORDER BY count DESC;

Tag Statistics

Current database contains approximately: - 15.68M tags across 98 categories - Primary categories: subsystem-name, security-risk, function-purpose - Tags enable semantic code search and intelligent analysis

Integration Notes

Tags are enriched post-CPG generation through: 1. Static analysis: CPG tag queries (cpg.method.tag.name(...)) 2. LLM enrichment: AI-generated function purpose descriptions 3. Computed metrics: Cyclomatic complexity from CFG analysis 4. Manual annotation: Security audit findings

Tags extend the CPG without modifying the core schema, maintaining compatibility with standard CPG tools while enabling advanced semantic analysis.


Last updated: 2026-02-28