For Security Architects, AppSec Engineers, and CISO Teams
Table of Contents¶
- Abstract
- 1. The Problem
- 1.1 Limitations of Traditional SAST
- 1.2 Why Pattern Matching Is Not Enough
- 2. Methodology
- 2.1 Multi-Criteria Approach
- 2.2 Hypothesis-Driven vs. Pattern-Driven
- 2.3 Cartesian Product Strategy
- 3. Architecture Overview
- 3.1 Pipeline
- 3.2 Domain Providers
- 3.3 Scoring Presets
- 4. V2 Capabilities
- 4.1 Evidence Qualification
- 4.2 Vulnerability Chain Detection
- 4.3 Analyst Feedback Loop
- 4.4 Incremental Analysis
- 4.5 Trend Monitoring
- 5. Deployment Patterns
- 5.1 CLI-Driven Analysis
- 5.2 CI/CD Pipeline Integration
- 5.3 REST API and MCP Integration
- 6. Validation Results
- 6.1 Comparison with Traditional SAST
- 6.2 Taint Verification and Path Feasibility
- 7. Conclusion and ROI
- Related Documents
Abstract¶
Traditional SAST tools suffer from false positive rates of 70-90%, making triage a dominant cost in application security programs. CodeGraph addresses this with a multi-criteria hypothesis validation system that:
- Generates testable hypotheses from CWE/CAPEC knowledge bases
- Scores hypotheses across three weighted criteria using codebase context
- Verifies findings through taint analysis on Code Property Graph (CPG)
- Incorporates analyst feedback to suppress confirmed false positives
The result: 100% CVE detection on target codebases with false positive rates reduced from 70-90% to under 15% — a 5-6x improvement in analyst productivity.
For API reference, data models, and code examples, see Hypothesis System Reference.
1. The Problem¶
1.1 Limitations of Traditional SAST¶
Traditional SAST:
Pattern: "strcpy" found
Result: POSSIBLE vulnerability
False Positive Rate: 70-90%
CodeGraph:
Hypothesis: Untrusted data flows from recv() to strcpy()
Evidence: Taint path verified via CPG
Result: CONFIRMED vulnerability
False Positive Rate: <15%
The cost of false positives is not merely inconvenience — it is the primary reason security findings get ignored. A team reviewing 100 findings where 80 are false positives quickly learns to dismiss all findings, including the real vulnerabilities.
1.2 Why Pattern Matching Is Not Enough¶
| Problem | Description | Impact |
|---|---|---|
| No context | strcpy is safe if source is a constant |
FP on safe code |
| No data flow | Doesn’t consider where data originates | Misses indirect paths |
| No sanitization | Ignores validator/sanitizer functions | FP on protected code |
| No prioritization | All findings have equal weight | Alert fatigue |
| No learning | Dismissed FPs reappear next scan | Wasted analyst time |
2. Methodology¶
2.1 Multi-Criteria Approach¶
Traditional vulnerability scoring uses a single dimension (typically CVSS base score). CodeGraph’s hypothesis system evaluates three independent criteria:
| Criterion | Weight | Rationale |
|---|---|---|
| CWE Frequency | 0.40 | How often this weakness appears in real-world CVEs. High-prevalence CWEs are more likely to be present and exploitable. |
| Attack Similarity | 0.30 | How well the hypothesis matches known attack patterns from CAPEC. Considers attack likelihood and required skill level. |
| Codebase Exposure | 0.30 | How exposed the specific codebase is — presence of dangerous sinks, untrusted sources, and absence of sanitizers. |
The 40/30/30 split reflects empirical observation: CWE prevalence is the strongest predictor of real findings, but codebase context and attack feasibility provide essential disambiguation.
Bonus multipliers further adjust priority: known CVE patterns (+20%), critical severity (+10%), recent exploitation in the wild (+15%). These bonuses compound — a hypothesis matching a recently exploited critical CVE pattern receives up to 1.52x boost.
Full scoring formula, component breakdowns, and API: Hypothesis System Reference — Multi-Criteria Scoring
2.2 Hypothesis-Driven vs. Pattern-Driven¶
| Aspect | Pattern-Driven SAST | Hypothesis-Driven (CodeGraph) |
|---|---|---|
| Input | Regex/AST patterns | Structured hypothesis: source → sink → CWE → attack |
| Evidence | Pattern match location | Taint path through CPG + codebase exposure analysis |
| Prioritization | Severity label only | 3-criteria score + bonus multipliers |
| False Positive Handling | Suppress rules (manual) | Feedback loop with automatic adjustment |
| Incrementality | Full rescan | Git diff-based delta analysis |
| Output | Flat finding list | Prioritized hypotheses with evidence chains |
The key insight: a vulnerability is not just a pattern match at a single point — it is a hypothesis about data flow that must be tested against the actual code graph.
2.3 Cartesian Product Strategy¶
The generation engine creates hypotheses from the Cartesian product:
Hypotheses = CWE Database (58 entries)
x CAPEC Patterns (27 attacks)
x Language Patterns (per-framework sinks/sources/sanitizers)
→ Filtered by codebase relevance
→ Scored and ranked
This ensures comprehensive coverage: every combination of weakness, attack pattern, and language-specific dangerous function is evaluated. The scoring phase then eliminates irrelevant combinations — typically reducing thousands of theoretical hypotheses to 20-50 high-priority candidates for execution.
3. Architecture Overview¶
3.1 Pipeline¶
+-------------------------------------------------------------------+
| HYPOTHESIS VALIDATION PIPELINE |
+-------------------------------------------------------------------+
[1] GENERATION ──> [2] SCORING ──> [3] QUERY SYNTHESIS
CWE x CAPEC 3-criteria SQL/PGQ
x Patterns + bonuses templates
|
v
[6] FEEDBACK <── [5] VALIDATION <── [4] EXECUTION
Analyst Confirm/ Run against
review Reject CPG database
Each stage is independently configurable and can be run in isolation. For example, scoring presets allow different weight profiles for embedded vs. web vs. enterprise applications.
For class names, constructor signatures, and method details, see Hypothesis System Reference — Pipeline Architecture
3.2 Domain Providers¶
The system ships with 6 specialized domain providers, each supplying framework-specific sinks, sources, and sanitizers:
| Provider | Framework | Language | Example Sinks |
|---|---|---|---|
PostgreSQLPatternProvider |
PostgreSQL | C | appendPQExpBuffer, SPI_execute, PQexec |
DjangoPatternProvider |
Django | Python | raw(), extra(), RawSQL() |
SpringPatternProvider |
Spring | Java | JdbcTemplate.query(), @RequestMapping |
ExpressPatternProvider |
Express | JavaScript | res.send(), eval(), child_process.exec() |
GinPatternProvider |
Gin | Go | c.String(), exec.Command(), db.Raw() |
NextJSPatternProvider |
Next.js | TypeScript | dangerouslySetInnerHTML, getServerSideProps |
Additionally, YAMLRuleProvider allows teams to define custom rules in YAML configuration files for internal frameworks and proprietary APIs.
All providers implement the PatternProvider base interface: get_sinks(), get_sources(), get_sanitizers().
3.3 Scoring Presets¶
Different application profiles warrant different scoring weights. Three built-in presets:
| Preset | CWE Frequency | Attack Similarity | Codebase Exposure | Best For |
|---|---|---|---|---|
| default | 0.40 | 0.30 | 0.30 | General-purpose analysis |
| embedded | 0.50 | 0.20 | 0.30 | IoT/embedded (CWE prevalence dominates) |
| web | 0.30 | 0.40 | 0.30 | Web apps (attack patterns more relevant) |
| enterprise | 0.35 | 0.35 | 0.30 | Enterprise (balanced CWE + attack) |
The embedded preset emphasizes CWE frequency because embedded systems have well-known vulnerability patterns (buffer overflows, integer overflows) where historical CWE data is highly predictive. The web preset emphasizes attack similarity because web application vulnerabilities are more diverse and attack pattern matching better discriminates real threats.
Presets are configured via config.yaml → hypothesis.scoring.presets or passed directly to the scorer constructor.
4. V2 Capabilities¶
4.1 Evidence Qualification¶
EvidenceQualifier provides multi-factor confidence assessment for each piece of evidence:
| Factor | Description | Weight |
|---|---|---|
has_taint_path |
Data flow from source to sink verified via CPG | Highest |
sanitizer_absent |
No sanitizer function found on the path | High |
cross_function |
Taint crosses function boundaries | Medium |
user_facing_source |
Source is user-controlled input | Medium |
exploitable_sink |
Sink is known to be directly exploitable | Medium |
A hypothesis with has_taint_path=True, sanitizer_absent=True, and user_facing_source=True receives a confidence score near 1.0, making it a high-priority finding. This replaces the binary “found/not found” assessment of traditional tools.
4.2 Vulnerability Chain Detection¶
ChainDetector identifies privilege escalation chains — sequences where exploiting vulnerability A enables vulnerability B:
Example chain:
CWE-200 (Information Disclosure) → reveals memory layout
→ CWE-125 (Out-of-bounds Read) → leaks canary value
→ CWE-120 (Buffer Overflow) → achieves code execution
Detection: ChainDetector.detect(hypotheses) → List[VulnerabilityChain]
Chains are constructed from a built-in escalation pattern graph (ESCALATION_PATTERNS) that encodes known multi-step attack sequences. A chain of individually medium-severity findings may represent a critical composite vulnerability.
4.3 Analyst Feedback Loop¶
FeedbackStore implements persistent analyst feedback that improves scoring accuracy over time:
Analyst marks finding as FP
→ FeedbackStore.record_outcome(hypothesis_id, "false_positive")
→ Next scan: get_adjustment() applies negative weight
→ After N confirmations: get_suppressed() auto-suppresses
Analyst confirms finding
→ FeedbackStore.record_outcome(hypothesis_id, "true_positive")
→ Next scan: similar hypotheses get priority boost
The feedback loop is stored in a local SQLite database (~/.codegraph/hypothesis_feedback.sqlite) and persists across analysis runs. This addresses the critical SAST pain point: false positives that keep reappearing after every scan.
4.4 Incremental Analysis¶
IncrementalAnalyzer performs git diff-based delta analysis — only generating and scoring hypotheses for code changed between two commits:
IncrementalAnalyzer(db_path, source_path)
.run_incremental(from_ref="v1.0", to_ref="HEAD")
→ Identifies changed files and functions
→ Generates hypotheses scoped to changed code
→ Runs full scoring + validation pipeline
→ Returns: only new/changed findings
This makes hypothesis analysis practical for CI/CD: a full scan on 50K LOC takes ~30 seconds, but an incremental scan on a typical PR (100-500 LOC changed) completes in 1-3 seconds.
4.5 Trend Monitoring¶
HypothesisTrendStore tracks vulnerability trends across releases:
record_run(project, commit, timestamp, results)
→ Stores confirmed/rejected counts per category
get_trend(project, days=90)
→ Returns time series: buffer_overflow, injection, etc.
get_delta(project, from_commit, to_commit)
→ Returns: new findings, resolved findings, regressions
Trend data enables security dashboards showing whether the codebase is getting more or less secure over time, which CWE categories are trending up, and whether specific remediation efforts are having impact.
5. Deployment Patterns¶
5.1 CLI-Driven Analysis¶
Ad-hoc analysis by security engineers:
# Full hypothesis validation
python -m src.cli hypothesis run --language C --max 50 --min-priority 0.3
# List supported CWEs (optionally by category)
python -m src.cli hypothesis list-cwes --category buffer_overflow
# Show available domain providers
python -m src.cli hypothesis providers
# Output as JSON for automation
python -m src.cli hypothesis run --language C --format json > results.json
5.2 CI/CD Pipeline Integration¶
Incremental analysis on every PR:
# .github/workflows/security-hypothesis.yml
- name: Hypothesis Analysis
run: |
python -m src.cli hypothesis run \
--language C \
--max 100 \
--min-priority 0.3 \
--format json > hypothesis_results.json
- name: Check for Critical Findings
run: |
python -c "
import json, sys
data = json.load(open('hypothesis_results.json'))
critical = [h for h in data.get('hypotheses', [])
if h.get('priority_score', 0) > 0.8]
if critical:
print(f'BLOCKED: {len(critical)} critical findings')
sys.exit(1)
"
For incremental mode, the IncrementalAnalyzer scopes analysis to changed code only, reducing CI runtime to seconds.
5.3 REST API and MCP Integration¶
REST API:
POST /api/v1/security/hypotheses/run
Body: { "language": "C", "max_hypotheses": 50 }
Returns: { "hypotheses": [...], "metrics": {...} }
GET /api/v1/security/hypotheses/cwes?category=buffer_overflow
Returns: [{ "id": "CWE-120", "name": "Buffer Copy...", ... }]
GET /api/v1/security/hypotheses/providers
Returns: ["PostgreSQLPatternProvider", "DjangoPatternProvider", ...]
MCP (Model Context Protocol):
Tool: codegraph_hypothesis
Parameters: language, max_hypotheses, min_priority_score
Returns: structured hypothesis results for AI assistant consumption
6. Validation Results¶
6.1 Comparison with Traditional SAST¶
Benchmark on PostgreSQL 17 codebase (~1.3M LOC C):
| Tool | True Positives | False Positives | Precision | FP Rate |
|---|---|---|---|---|
| Pattern-only SAST | 3 | 45 | 6.25% | 93.75% |
| CodeGraph (hypothesis) | 3 | 2 | 60% | 40% |
| CodeGraph + TaintVerified | 3 | 0.4 | 88% | 12% |
Key observations:
- Pattern-only SAST produces 15x more false positives than CodeGraph
- Adding taint verification (TaintVerifiedScanner) further reduces FP rate from 40% to 12%
- All three target CVEs (CVE-2025-8713, CVE-2025-8714, CVE-2025-8715) were detected in all configurations — hypothesis scoring does not sacrifice recall
Full benchmark metrics and CVE details: Hypothesis System Reference — Performance
6.2 Taint Verification and Path Feasibility¶
Two additional verification layers reduce false positives beyond scoring:
Taint Path Visualization. Confirmed hypotheses include full data flow paths rendered as: - Mermaid flowcharts — interactive source-to-sink diagrams - SARIF 2.1.0 codeFlows — step-by-step taint propagation for IDE integration
z3 Path Feasibility. The z3 symbolic execution engine validates path constraints, eliminating infeasible paths where the exploit depends on impossible input conditions. This is particularly effective for conditional vulnerabilities:
Example:
Hypothesis: recv() → strcpy() may overflow
z3 analysis: path requires buffer_size > 1024 AND user_controlled = true
Result: FEASIBLE (no constraint contradiction) → hypothesis CONFIRMED
Hypothesis: getenv("DEBUG") → system()
z3 analysis: path requires DEBUG == "1" in production environment
Result: INFEASIBLE (contradicts deployment config) → hypothesis REJECTED
OWASP Top 10 Mapping. All confirmed findings are automatically classified per OWASP Top 10 categories via owasp_mapping.py, enabling compliance reporting.
7. Conclusion and ROI¶
Business Value¶
CodeGraph’s hypothesis validation system delivers measurable impact:
| Metric | Before (SAST) | After (CodeGraph) | Improvement |
|---|---|---|---|
| FP Rate | 70-90% | <15% | 5-6x reduction |
| Analyst Time per Finding | ~30 min | ~5 min | 6x faster triage |
| Findings per Scan | 100+ (mostly FP) | 10-20 (mostly TP) | Actionable output |
| Rescan Time (PR) | Full scan | Incremental (1-3s) | 10-30x faster |
| FP Recurrence | Every scan | Auto-suppressed | Zero repeat FPs |
Key Differentiators¶
- Contextual analysis — considers the specific codebase, not just patterns
- Taint verification — confirms data flow through CPG, not just pattern proximity
- Risk prioritization — 3-criteria scoring with domain-specific presets
- Feedback loop — analyst decisions permanently improve future scans
- Incremental mode — seconds-level CI/CD integration
- Chain detection — discovers multi-step attack paths invisible to single-finding tools
- 6 domain providers — framework-aware analysis out of the box
Related Documents¶
- Hypothesis System Reference — API reference, data models, code examples
- Enterprise Security Brief — Executive summary
- Competitive Matrix — Vendor comparison
Version: 2.0 | March 2026