Security Hypothesis Validation¶
Technical Document for Security Architects and Researchers
Table of Contents¶
- Abstract
- 1. The Problem
- 1.1 Limitations of Traditional SAST
- 1.2 Why Pattern Matching Is Not Enough
- 2. Solution Architecture
- 2.1 Hypothesis Validation Pipeline
- 3. Multi-Criteria Scoring Model
- 3.1 Prioritization Formula
- 3.2 Scoring Components
- 3.3 Bonus Multipliers
- 4. Codebase Statistics
- 4.1 Statistics Collection from CPG
- 4.2 Tracked Functions
- 5. Taint Analysis on CPG
- 5.1 Data Flow Verification
- 5.2 Sanitization Check
- 6. Validation Results
- 6.1 Benchmark on PostgreSQL 17
- 6.2 Detected CVEs
- 6.3 Comparison with Traditional SAST
- 7. Hypothesis Structure
- 7.1 SecurityHypothesis
- 7.2 Hypothesis Format
- 8. Integration API
- 8.1 Full Example
- 9. Conclusion
- Related Documents
Abstract¶
Traditional SAST (Static Application Security Testing) tools suffer from high false positive rates (up to 70-90%), making analysis results practically unusable for real work. CodeGraph solves this problem with a multi-criteria hypothesis validation system that:
- Generates testable hypotheses based on CWE/CAPEC knowledge bases
- Evaluates hypotheses across three criteria considering codebase context
- Verifies vulnerabilities through taint analysis on Code Property Graph
- Achieves 100% CVE detection rate while reducing the false positive rate from 70-90% to under 30% (60%+ reduction)
1. The Problem¶
1.1 Limitations of Traditional SAST¶
Traditional SAST:
Pattern: "strcpy" found
Result: POSSIBLE vulnerability
False Positive Rate: 70-90%
CodeGraph:
Hypothesis: Untrusted data flows from recv() to strcpy()
Evidence: Taint path verified via CPG
Result: CONFIRMED vulnerability
False Positive Rate: <30%
1.2 Why Pattern Matching Is Not Enough¶
| Problem | Description |
|---|---|
| No context | strcpy is safe if source is a constant |
| No data flow | Doesn’t consider where data comes from |
| No sanitization | Ignores validator functions |
| No prioritization | All findings have equal weight |
2. Solution Architecture¶
2.1 Hypothesis Validation Pipeline¶
┌─────────────────────────────────────────────────────────────────────────┐
│ HYPOTHESIS VALIDATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 1. GENERATION │ │
│ │ HypothesisGenerator.generate() │ │
│ │ ├── CWE Database (58 entries) │ │
│ │ ├── CAPEC Database (27 attack patterns) │ │
│ │ ├── Language Patterns (C, Python, Java) │ │
│ │ └── Cartesian Product: CWEs × CAPECs × Patterns │ │
│ │ Output: SecurityHypothesis[] │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 2. MULTI-CRITERIA SCORING │ │
│ │ MultiCriteriaScorer.score_batch() │ │
│ │ │ │
│ │ Score = CWE_Freq × 0.40 + Attack_Sim × 0.30 + Exposure × 0.30│ │
│ │ │ │
│ │ Bonuses: │ │
│ │ ├── Known CVE pattern: ×1.20 │ │
│ │ ├── Critical severity: ×1.10 │ │
│ │ └── Recent exploit: ×1.15 │ │
│ │ │ │
│ │ Output: Prioritized hypotheses │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 3. QUERY SYNTHESIS │ │
│ │ QuerySynthesizer.synthesize() │ │
│ │ ├── Match hypothesis to SQL template │ │
│ │ ├── Parameter substitution │ │
│ │ └── Output: DuckDB SQL / PGQ queries │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 4. EXECUTION │ │
│ │ HypothesisExecutor.execute() │ │
│ │ ├── Run queries against CPG │ │
│ │ ├── Collect evidence │ │
│ │ └── Output: Evidence[] │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 5. VALIDATION │ │
│ │ HypothesisValidator.validate() │ │
│ │ ├── Analyze evidence │ │
│ │ ├── Update hypothesis status (CONFIRMED/REJECTED) │ │
│ │ └── Calculate precision/recall metrics │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
3. Multi-Criteria Scoring Model¶
3.1 Prioritization Formula¶
Priority Score = (CWE_Frequency × 0.40)
+ (Attack_Similarity × 0.30)
+ (Codebase_Exposure × 0.30)
3.2 Scoring Components¶
CWE Frequency Score (40%)¶
Evaluates how often this vulnerability appears in real CVEs.
def _score_cwe_frequency(cwe_ids: List[str]) -> float:
"""
Components:
- prevalence: Frequency in CVE database (0.0-1.0)
- exploitability: How easy to exploit (0.0-1.0)
- cvss_base: CVSS base score / 10 (0.0-1.0)
Score = prevalence × 0.4 + exploitability × 0.4 + cvss × 0.2
"""
| CWE | Prevalence | Exploitability | CVSS | Score |
|---|---|---|---|---|
| CWE-120 (Buffer Overflow) | 0.85 | 0.90 | 8.0 | 0.86 |
| CWE-78 (Command Injection) | 0.75 | 0.95 | 9.8 | 0.88 |
| CWE-89 (SQL Injection) | 0.90 | 0.95 | 9.8 | 0.94 |
| CWE-200 (Info Disclosure) | 0.60 | 0.70 | 5.3 | 0.63 |
Attack Similarity Score (30%)¶
Evaluates how well the hypothesis matches known attack patterns from CAPEC.
def _score_attack_similarity(capec_ids: List[str]) -> float:
"""
Components:
- likelihood: Attack probability (0.0-1.0)
- skill_level: Required skill level
- Low: ×1.0 (higher risk)
- Medium: ×0.8
- High: ×0.6
- Expert: ×0.4 (lower risk)
Score = likelihood × skill_adjustment
"""
Codebase Exposure Score (30%)¶
Evaluates how exposed the specific codebase is to this vulnerability.
def _score_codebase_exposure(hypothesis) -> float:
"""
Components:
- sink_exposure: Presence of dangerous sink functions
- source_exposure: Presence of external data sources
- sanitizer_coverage: Presence of sanitizer functions (lowers risk)
- taint_paths: Number of source → sink paths
Exposure = (sink × 0.4 + source × 0.4) × (1 - sanitizer × 0.5)
"""
3.3 Bonus Multipliers¶
| Bonus | Multiplier | Condition |
|---|---|---|
| Known CVE | ×1.20 | Pattern matches known CVE |
| Critical Severity | ×1.10 | CWE has critical severity |
| Recent Exploit | ×1.15 | Recent exploitation in wild |
4. Codebase Statistics¶
4.1 Statistics Collection from CPG¶
@dataclass
class CodebaseStats:
total_methods: int # Total method count
total_calls: int # Total call count
sink_counts: Dict[str, int] # sink_name → count
source_counts: Dict[str, int] # source_name → count
sanitizer_counts: Dict[str, int] # sanitizer_name → count
taint_paths: int # Number of source→sink paths
4.2 Tracked Functions¶
Dangerous Sinks (C):
strcpy, strcat, sprintf, gets, memcpy
system, popen, execl, execv
printf, fprintf (format string)
appendPQExpBuffer, SPI_execute, PQexec (PostgreSQL)
Untrusted Sources:
recv, read, fgets, getenv
PQgetvalue, SPI_getvalue, getTables (PostgreSQL)
Sanitizers:
strlcpy, snprintf
fmtId, quote_identifier, quote_literal
pg_class_aclcheck
5. Taint Analysis on CPG¶
5.1 Data Flow Verification¶
-- Find unvalidated paths from source to sink
FROM GRAPH_TABLE(cpg
MATCH (src:CALL)-[:REACHING_DEF*1..10]->(sink:CALL)
WHERE src.name IN ('recv', 'getenv', 'PQgetvalue')
AND sink.name IN ('strcpy', 'sprintf', 'system')
COLUMNS (
src.name AS source,
sink.name AS sink,
sink.filename,
sink.line_number
)
)
5.2 Sanitization Check¶
-- Check for sanitizers on the path
SELECT h.id, h.source, h.sink,
EXISTS (
SELECT 1 FROM nodes_call nc
WHERE nc.name IN ('strlcpy', 'snprintf', 'quote_identifier')
AND nc.line_number BETWEEN h.source_line AND h.sink_line
AND nc.filename = h.filename
) AS has_sanitizer
FROM hypothesis_paths h;
6. Validation Results¶
6.1 Benchmark on PostgreSQL 17¶
| Metric | Value |
|---|---|
| CVE Detection Rate | 100% (3/3) |
| Hypothesis Confirmation Rate | 55% |
| Average Query Time | 2-3 ms |
| Generation Time (100 hyp.) | <1 sec |
| Execution Time (20 hyp.) | <30 sec |
6.2 Detected CVEs¶
| CVE ID | Type | Detection Method |
|---|---|---|
| CVE-2025-8713 | Statistics Disclosure | Hypothesis + Taint |
| CVE-2025-8714 | pg_dump Injection | Method-based |
| CVE-2025-8715 | Newline Injection | Method-based |
6.3 Comparison with Traditional SAST¶
| Tool | True Positives | False Positives | Precision |
|---|---|---|---|
| Pattern SAST | 3 | 45 | 6.25% |
| CodeGraph | 3 | 2 | 60% |
| TaintVerifiedScanner | 3 | 0.4 | 88% (12% FP rate) |
7. Hypothesis Structure¶
7.1 SecurityHypothesis¶
@dataclass
class SecurityHypothesis:
id: str # Unique identifier
hypothesis_text: str # Hypothesis text
# Classification
cwe_ids: List[str] # ["CWE-120", "CWE-119"]
capec_ids: List[str] # ["CAPEC-100"]
language: str # "C", "Python"
category: str # "buffer_overflow"
# Taint patterns
source_patterns: List[str] # ["PQgetvalue", "getenv"]
sink_patterns: List[str] # ["strcpy", "memcpy"]
sanitizer_patterns: List[str] # ["strlcpy", "sizeof"]
# Scoring
priority_score: float # 0.0-1.0+
confidence: float # 0.0-1.0
# Multi-criteria breakdown
cwe_frequency_score: float
attack_similarity_score: float
codebase_exposure_score: float
# Validation
sql_query: Optional[str]
evidence: List[Evidence]
validation_status: ValidationStatus
7.2 Hypothesis Format¶
"If untrusted data from {sources} flows to {sinks}
without sanitization via {sanitizers},
then {cwe_id} enables {capec_id} attack,
potentially allowing {impact}."
Example:
"If untrusted data from PQgetvalue() flows to strcpy()
without bounds checking via strlcpy(),
then CWE-120 enables CAPEC-100 (Buffer Overflow) attack,
potentially allowing memory corruption or code execution."
8. Integration API¶
8.1 Full Example¶
from src.security.hypothesis import (
HypothesisGenerator,
MultiCriteriaScorer,
QuerySynthesizer,
HypothesisExecutor,
HypothesisValidator,
CodebaseStats,
compute_codebase_stats_from_duckdb
)
import duckdb
from src.project_manager import ProjectManager
# 1. Connect to CPG
conn = duckdb.connect(ProjectManager.get_active_db_path())
# 2. Gather codebase statistics
stats = compute_codebase_stats_from_duckdb(ProjectManager.get_active_db_path())
# 3. Generate hypotheses
generator = HypothesisGenerator()
hypotheses = generator.generate(
language="C",
cwe_filter=["CWE-120", "CWE-78", "CWE-89"],
max_hypotheses=100
)
# 4. Multi-criteria scoring
scorer = MultiCriteriaScorer(codebase_stats=stats)
scored = scorer.score_batch(hypotheses)
# 5. Synthesize SQL queries
synthesizer = QuerySynthesizer()
for h in scored:
h.sql_query = synthesizer.synthesize_query(h)
# 6. Execute on CPG
executor = HypothesisExecutor(conn)
for h in scored[:20]: # Top 20
evidence = executor.execute(h)
h.evidence.extend(evidence)
# 7. Validate and report
validator = HypothesisValidator()
results = validator.validate_batch(scored)
print(f"Detection Rate: {results.detection_rate:.1%}")
print(f"Precision: {results.precision:.1%}")
print(f"F1 Score: {results.f1_score:.2f}")
9. Conclusion¶
CodeGraph’s multi-criteria hypothesis validation system represents a fundamentally new approach to vulnerability detection:
- Contextual analysis — considers specifics of the particular codebase
- Taint verification — confirms data flow through CPG
- Risk prioritization — focus on actually exploitable vulnerabilities
- False positive reduction — from 70-90% down to less than 30%
Result: 100% detection rate for target CVEs with dramatic reduction in false positives.
Taint Visualization and SARIF Integration¶
Confirmed hypotheses can be exported with full taint path visualization:
- Mermaid flowcharts (src/security/taint_visualizer.py) render source-to-sink data flows as interactive diagrams
- SARIF 2.1.0 export (src/security/sarif_exporter.py) includes codeFlows with step-by-step taint propagation
- OWASP Top 10 mapping (src/security/owasp_mapping.py) classifies findings for compliance reporting
Path Feasibility with z3¶
The z3 symbolic execution engine validates path constraints for confirmed hypotheses, eliminating infeasible paths and further reducing false positives. This is particularly effective for conditional vulnerabilities where the exploit depends on specific input ranges or configuration.
Related Documents¶
Version: 1.1 | February 2026