Structural Pattern Search

Developer or security engineer finding code matching structural patterns with CPG-aware constraints, powered by GoCPG’s AST-based engine.

Table of Contents

Quick Start

/select pattern_search

Or via CLI:

# Ad-hoc pattern search
python -m src.cli patterns search "malloc($x)" --lang c

# Scan with all rules
python -m src.cli patterns scan

# Scan specific rule
python -m src.cli patterns scan --rule unchecked-return

How It Works

Architecture

Structural Pattern Search is a standalone tool (not a LangGraph scenario) accessible via CLI, REST API, and MCP. The backend is GoCPG’s pattern engine, communicating over gRPC:

CLI / REST API / MCP
        |
        v
  GoCPGClient (.scan(), .search(), .validate_rule())
        |  gRPC
        v
  gocpg binary (AST matching + CPG constraints)
        |
        v
  DuckDB (cpg_pattern_results, cpg_pattern_rules)
        |
        v
  PatternQueriesMixin (Python reads persisted results)
        |
        +---> PatternTaintBridge (enrich with taint paths)
        +---> SSRAutofixBridge (generate security fixes)

GoCPGClient.scan() runs YAML rules against the CPG database. GoCPGClient.search() performs ad-hoc AST pattern matching. Results are persisted in cpg_pattern_results and queried by PatternQueriesMixin in src/services/cpg/pattern_queries.py.

Pattern Types

Syntactic patterns — tree-sitter CST patterns with metavariables:

Metavariable Matches
$VAR Any single expression or identifier
$$ARGS Zero or more arguments
$_ Any node (wildcard)
# Find malloc calls
python -m src.cli patterns search "malloc($x)" --lang c

# Find if-return without else
python -m src.cli patterns search "if ($cond) { return $val; }" --lang c

CPG-constrained patterns — patterns with data flow, call graph, type, and domain constraints:

id: unchecked-return
pattern: "$ret = $func($$args)"
language: c
constraints:
  - type: data_flow
    from: "$ret"
    not_reaches: "if ($ret"
  - type: call_graph
    callee: "$func"
    returns: "int"
message: "Return value of $func is not checked"
severity: warning

YAML Rules

Pre-defined rules in configs/rules/ — 190 rules across 14 languages. Domain-specific rules can be auto-loaded from the active domain plugin with --domain-rules.

CLI Usage

6 subcommands under python -m src.cli patterns:

# Scan with all rules
python -m src.cli patterns scan --db data/projects/test.duckdb

# Scan with severity filter and SARIF output
python -m src.cli patterns scan --severity error --format sarif --output results.sarif

# Scan with domain-specific rules auto-loaded
python -m src.cli patterns scan --domain-rules

# Scan with incremental evaluation
python -m src.cli patterns scan --incremental

# Scan a specific rule
python -m src.cli patterns scan --rule unchecked-return

# Ad-hoc pattern search
python -m src.cli patterns search "malloc($SIZE)" --lang c --max-results 50

# Apply fixes (dry run)
python -m src.cli patterns fix --dry-run

# Apply fixes (with approval)
python -m src.cli patterns fix --rule unchecked-return

# List loaded pattern rules
python -m src.cli patterns list

# Show pattern statistics
python -m src.cli patterns stats

# Generate rule from natural language
python -m src.cli patterns generate "find unchecked return values" --lang c --output rule.yaml

Output formats: text (default), json, sarif. The fix subcommand uses ApprovalEngine for interactive approval before applying changes.

REST API

6 endpoints mounted at /api/v1/patterns/:

Method Endpoint Description
POST /api/v1/patterns/search Ad-hoc structural pattern search
GET /api/v1/patterns/findings Query persisted pattern findings (filters: rule_id, severity, filename, category)
GET /api/v1/patterns/stats Aggregated statistics by severity, category, rule
GET /api/v1/patterns/rules List loaded pattern rules from cpg_pattern_rules
POST /api/v1/patterns/generate LLM-generate a YAML rule from description
POST /api/v1/patterns/fix Apply SSR fixes (dry_run=true by default, approval required)

Example:

# Search patterns
curl -X POST http://localhost:8000/api/v1/patterns/search \
  -H "Content-Type: application/json" \
  -d '{"pattern": "malloc($x)", "language": "c", "max_results": 50}'

# Get findings for a rule
curl "http://localhost:8000/api/v1/patterns/findings?rule_id=unchecked-return"

# Get statistics
curl http://localhost:8000/api/v1/patterns/stats

# List rules
curl http://localhost:8000/api/v1/patterns/rules

# Generate rule
curl -X POST http://localhost:8000/api/v1/patterns/generate \
  -H "Content-Type: application/json" \
  -d '{"description": "find unchecked return values", "language": "c"}'

# Apply fix (dry run)
curl -X POST http://localhost:8000/api/v1/patterns/fix \
  -H "Content-Type: application/json" \
  -d '{"rule_id": "unchecked-return", "dry_run": true}'

MCP Tools

6 tools registered in src/mcp/tools/patterns.py:

Tool Parameters Description
codegraph_pattern_search pattern, language, max_results AST-based structural search
codegraph_pattern_findings rule_id?, severity?, filename?, category?, limit Query persisted findings
codegraph_pattern_stats (none) Aggregated statistics
codegraph_pattern_fix rule_id?, dry_run Apply SSR fixes
codegraph_pattern_generate description, language, with_fix LLM-generate YAML rule
codegraph_pattern_test rule_yaml, code_snippet, language Test rule against code snippet

Data Models

Key models from src/services/gocpg/models.py:

Model Key Fields
ScanConfig rule_dirs, rule_id, severity_filter, incremental, fix, dry_run, output_format, output_path, domain_rules
GoCPGScanResult findings, rules_evaluated, files_scanned, total_matches, duration_ms, incremental, sarif_path
GoCPGSearchResult matches, pattern, language, files_searched
GeneratedRule yaml_text, rule_id, language, has_fix, validated, validation_errors, generation_attempts

LLM Rule Generation

LLMPatternGenerator in src/analysis/patterns/llm_pattern_generator.py generates YAML rules from natural language descriptions:

  1. Builds a structured prompt from config/prompts/patterns/generate_pattern.yaml
  2. Calls the configured LLM provider
  3. Parses YAML from the response
  4. Validates via gocpg validate-rule (up to 3 retry attempts on failure)
  5. Returns GeneratedRule with validation status
# CLI
python -m src.cli patterns generate "find SQL queries built with string concatenation" --lang python --output rule.yaml

# MCP
codegraph_pattern_generate(description="find unchecked malloc calls", language="c", with_fix=true)

The generate_rule() method is async. The with_fix parameter controls whether a fix: template is included in the generated rule.

Security Integration

PatternTaintBridge

PatternTaintBridge in src/analysis/patterns/taint_bridge.py enriches structural pattern findings with taint analysis data from S02 (Security Audit):

  • Constructor: PatternTaintBridge(cpg_service, taint_propagator=None)
  • Method: enrich_findings_with_taint(findings) (async)
  • For findings with has_cpg=True, queries CPG for taint paths flowing through the matched node
  • Adds taint_paths and taint_enriched keys to each finding dict

This bridges the gap between structural pattern detection and security vulnerability analysis — a pattern finding at a specific code location can be cross-referenced with taint flow data to determine if untrusted data reaches that location.

SSR Autofix Bridge

SSRAutofixBridge in src/analysis/autofix/ssr_bridge.py connects pattern engine SSR rules with the AutofixEngine described in Security Audit: Autofix:

  • Maps vulnerability types to SSR rule IDs in configs/rules/autofix/
  • Runs gocpg scan --fix --dry-run per file batch to get AST-aware fix previews
  • Converts results to FixSuggestion objects for the autofix pipeline

Vulnerability type mapping (excerpt):

Vulnerability SSR Rules
sql_injection autofix-sprintf-snprintf, autofix-py-format-sql, autofix-go-sprintf-sql, …
buffer_overflow autofix-sprintf-snprintf, autofix-strcpy, autofix-py-ctypes, autofix-go-cgo-strcpy
null_dereference autofix-null-assert
command_injection autofix-py-subprocess, autofix-go-exec

The flow: Pattern scan detects code issues → SSRAutofixBridge maps to fix rules → AutofixEngine generates patches → DiffValidator verifies. SSR fixes have the highest confidence (0.8-1.0) in the autofix pipeline. See Security Audit: Autofix for the full autofix architecture.

DB Tables

GoCPG persists scan results in DuckDB:

  • cpg_pattern_results — findings: id, rule_id, severity, category, filename, line_number, column_number, code, message, confidence, match_data, cpg_context
  • cpg_pattern_rules — loaded rules: rule_id, language, severity, category, has_cpg, rule_source

Python reads these via PatternQueriesMixin methods: get_pattern_findings(), get_pattern_rules(), get_pattern_statistics().

Example Questions

Find unchecked return values
Find malloc without free
Show functions matching error-handling anti-patterns
Find SQL query construction without parameterization
Find all functions with cyclomatic complexity > 20