Security Hypothesis Validation¶
Technical Document for Security Architects and Researchers
Table of Contents¶
- Abstract
- 1. The Problem
- 1.1 Limitations of Traditional SAST
- 1.2 Why Pattern Matching Is Not Enough
- 2. Solution Architecture
- 2.1 Hypothesis Validation Pipeline
- 3. Multi-Criteria Scoring Model
- 3.1 Prioritization Formula
- 3.2 Scoring Components
- 3.3 Bonus Multipliers
- 4. Codebase Statistics
- 4.1 Statistics Collection from CPG
- 4.2 Tracked Functions
- 5. Taint Analysis on CPG
- 5.1 Data Flow Verification
- 5.2 Sanitization Check
- 6. Validation Results
- 6.1 Benchmark on PostgreSQL 17
- 6.2 Detected CVEs
- 6.3 Comparison with Traditional SAST
- 7. Hypothesis Structure
- 7.1 SecurityHypothesis
- 7.2 Hypothesis Format
- 8. Integration API
- 8.1 Full Example
- 9. Conclusion
- Related Documents
Abstract¶
Traditional SAST (Static Application Security Testing) tools suffer from high false positive rates (up to 70-90%), making analysis results practically unusable for real work. CodeGraph solves this problem with a multi-criteria hypothesis validation system that:
- Generates testable hypotheses based on CWE/CAPEC knowledge bases
- Evaluates hypotheses across three criteria considering codebase context
- Verifies vulnerabilities through taint analysis on Code Property Graph
- Achieves 100% CVE detection rate while reducing false positives by 60%+
1. The Problem¶
1.1 Limitations of Traditional SAST¶
Traditional SAST:
Pattern: "strcpy" found
Result: POSSIBLE vulnerability
False Positive Rate: 70-90%
CodeGraph:
Hypothesis: Untrusted data flows from recv() to strcpy()
Evidence: Taint path verified via CPG
Result: CONFIRMED vulnerability
False Positive Rate: <30%
1.2 Why Pattern Matching Is Not Enough¶
| Problem | Description |
|---|---|
| No context | strcpy is safe if source is a constant |
| No data flow | Doesn’t consider where data comes from |
| No sanitization | Ignores validator functions |
| No prioritization | All findings have equal weight |
2. Solution Architecture¶
2.1 Hypothesis Validation Pipeline¶
┌─────────────────────────────────────────────────────────────────────────┐
│ HYPOTHESIS VALIDATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 1. GENERATION │ │
│ │ HypothesisGenerator.generate() │ │
│ │ ├── CWE Database (120+ patterns) │ │
│ │ ├── CAPEC Database (50+ attack patterns) │ │
│ │ ├── Language Patterns (C, Python, Java) │ │
│ │ └── Cartesian Product: CWEs × CAPECs × Patterns │ │
│ │ Output: SecurityHypothesis[] │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 2. MULTI-CRITERIA SCORING │ │
│ │ MultiCriteriaScorer.score_batch() │ │
│ │ │ │
│ │ Score = CWE_Freq × 0.40 + Attack_Sim × 0.30 + Exposure × 0.30│ │
│ │ │ │
│ │ Bonuses: │ │
│ │ ├── Known CVE pattern: ×1.20 │ │
│ │ ├── Critical severity: ×1.10 │ │
│ │ └── Recent exploit: ×1.15 │ │
│ │ │ │
│ │ Output: Prioritized hypotheses │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 3. QUERY SYNTHESIS │ │
│ │ QuerySynthesizer.synthesize() │ │
│ │ ├── Match hypothesis to SQL template │ │
│ │ ├── Parameter substitution │ │
│ │ └── Output: DuckDB SQL / PGQ queries │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 4. EXECUTION │ │
│ │ HypothesisExecutor.execute() │ │
│ │ ├── Run queries against CPG │ │
│ │ ├── Collect evidence │ │
│ │ └── Output: Evidence[] │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 5. VALIDATION │ │
│ │ HypothesisValidator.validate() │ │
│ │ ├── Analyze evidence │ │
│ │ ├── Update hypothesis status (CONFIRMED/REJECTED) │ │
│ │ └── Calculate precision/recall metrics │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
3. Multi-Criteria Scoring Model¶
3.1 Prioritization Formula¶
Priority Score = (CWE_Frequency × 0.40)
+ (Attack_Similarity × 0.30)
+ (Codebase_Exposure × 0.30)
3.2 Scoring Components¶
CWE Frequency Score (40%)¶
Evaluates how often this vulnerability appears in real CVEs.
def _score_cwe_frequency(cwe_ids: List[str]) -> float:
"""
Components:
- prevalence: Frequency in CVE database (0.0-1.0)
- exploitability: How easy to exploit (0.0-1.0)
- cvss_base: CVSS base score / 10 (0.0-1.0)
Score = prevalence × 0.4 + exploitability × 0.4 + cvss × 0.2
"""
| CWE | Prevalence | Exploitability | CVSS | Score |
|---|---|---|---|---|
| CWE-120 (Buffer Overflow) | 0.85 | 0.90 | 8.0 | 0.86 |
| CWE-78 (Command Injection) | 0.75 | 0.95 | 9.8 | 0.88 |
| CWE-89 (SQL Injection) | 0.90 | 0.95 | 9.8 | 0.94 |
| CWE-200 (Info Disclosure) | 0.60 | 0.70 | 5.3 | 0.63 |
Attack Similarity Score (30%)¶
Evaluates how well the hypothesis matches known attack patterns from CAPEC.
def _score_attack_similarity(capec_ids: List[str]) -> float:
"""
Components:
- likelihood: Attack probability (0.0-1.0)
- skill_level: Required skill level
- Low: ×1.0 (higher risk)
- Medium: ×0.8
- High: ×0.6
- Expert: ×0.4 (lower risk)
Score = likelihood × skill_adjustment
"""
Codebase Exposure Score (30%)¶
Evaluates how exposed the specific codebase is to this vulnerability.
def _score_codebase_exposure(hypothesis) -> float:
"""
Components:
- sink_exposure: Presence of dangerous sink functions
- source_exposure: Presence of external data sources
- sanitizer_coverage: Presence of sanitizer functions (lowers risk)
- taint_paths: Number of source → sink paths
Exposure = (sink × 0.4 + source × 0.4) × (1 - sanitizer × 0.5)
"""
3.3 Bonus Multipliers¶
| Bonus | Multiplier | Condition |
|---|---|---|
| Known CVE | ×1.20 | Pattern matches known CVE |
| Critical Severity | ×1.10 | CWE has critical severity |
| Recent Exploit | ×1.15 | Recent exploitation in wild |
4. Codebase Statistics¶
4.1 Statistics Collection from CPG¶
@dataclass
class CodebaseStats:
total_methods: int # Total method count
total_calls: int # Total call count
sink_counts: Dict[str, int] # sink_name → count
source_counts: Dict[str, int] # source_name → count
sanitizer_counts: Dict[str, int] # sanitizer_name → count
taint_paths: int # Number of source→sink paths
4.2 Tracked Functions¶
Dangerous Sinks (C):
strcpy, strcat, sprintf, gets, memcpy
system, popen, execl, execv
printf, fprintf (format string)
appendPQExpBuffer, SPI_execute, PQexec (PostgreSQL)
Untrusted Sources:
recv, read, fgets, getenv
PQgetvalue, SPI_getvalue, getTables (PostgreSQL)
Sanitizers:
strlcpy, snprintf
fmtId, quote_identifier, quote_literal
pg_class_aclcheck
5. Taint Analysis on CPG¶
5.1 Data Flow Verification¶
-- Find unvalidated paths from source to sink
FROM GRAPH_TABLE(cpg
MATCH (src:CALL)-[:REACHING_DEF*1..10]->(sink:CALL)
WHERE src.name IN ('recv', 'getenv', 'PQgetvalue')
AND sink.name IN ('strcpy', 'sprintf', 'system')
COLUMNS (
src.name AS source,
sink.name AS sink,
sink.filename,
sink.line_number
)
)
5.2 Sanitization Check¶
-- Check for sanitizers on the path
SELECT h.id, h.source, h.sink,
EXISTS (
SELECT 1 FROM nodes_call nc
WHERE nc.name IN ('strlcpy', 'snprintf', 'quote_identifier')
AND nc.line_number BETWEEN h.source_line AND h.sink_line
AND nc.filename = h.filename
) AS has_sanitizer
FROM hypothesis_paths h;
6. Validation Results¶
6.1 Benchmark on PostgreSQL 17¶
| Metric | Value |
|---|---|
| CVE Detection Rate | 100% (3/3) |
| Hypothesis Confirmation Rate | 55% |
| Average Query Time | 2-3 ms |
| Generation Time (100 hyp.) | <1 sec |
| Execution Time (20 hyp.) | <30 sec |
6.2 Detected CVEs¶
| CVE ID | Type | Detection Method |
|---|---|---|
| CVE-2025-8713 | Statistics Disclosure | Hypothesis + Taint |
| CVE-2025-8714 | pg_dump Injection | Method-based |
| CVE-2025-8715 | Newline Injection | Method-based |
6.3 Comparison with Traditional SAST¶
| Tool | True Positives | False Positives | Precision |
|---|---|---|---|
| Pattern SAST | 3 | 45 | 6.25% |
| CodeGraph | 3 | 2 | 60% |
7. Hypothesis Structure¶
7.1 SecurityHypothesis¶
@dataclass
class SecurityHypothesis:
id: str # Unique identifier
hypothesis_text: str # Hypothesis text
# Classification
cwe_ids: List[str] # ["CWE-120", "CWE-119"]
capec_ids: List[str] # ["CAPEC-100"]
language: str # "C", "Python"
category: str # "buffer_overflow"
# Taint patterns
source_patterns: List[str] # ["PQgetvalue", "getenv"]
sink_patterns: List[str] # ["strcpy", "memcpy"]
sanitizer_patterns: List[str] # ["strlcpy", "sizeof"]
# Scoring
priority_score: float # 0.0-1.0+
confidence: float # 0.0-1.0
# Multi-criteria breakdown
cwe_frequency_score: float
attack_similarity_score: float
codebase_exposure_score: float
# Validation
sql_query: Optional[str]
evidence: List[Evidence]
validation_status: ValidationStatus
7.2 Hypothesis Format¶
"If untrusted data from {sources} flows to {sinks}
without sanitization via {sanitizers},
then {cwe_id} enables {capec_id} attack,
potentially allowing {impact}."
Example:
"If untrusted data from PQgetvalue() flows to strcpy()
without bounds checking via strlcpy(),
then CWE-120 enables CAPEC-100 (Buffer Overflow) attack,
potentially allowing memory corruption or code execution."
8. Integration API¶
8.1 Full Example¶
from src.security.hypothesis import (
HypothesisGenerator,
MultiCriteriaScorer,
QuerySynthesizer,
HypothesisExecutor,
HypothesisValidator,
CodebaseStats,
compute_codebase_stats_from_duckdb
)
import duckdb
# 1. Connect to CPG
conn = duckdb.connect("cpg.duckdb")
# 2. Gather codebase statistics
stats = compute_codebase_stats_from_duckdb("cpg.duckdb")
# 3. Generate hypotheses
generator = HypothesisGenerator()
hypotheses = generator.generate(
language="C",
cwe_filter=["CWE-120", "CWE-78", "CWE-89"],
max_hypotheses=100
)
# 4. Multi-criteria scoring
scorer = MultiCriteriaScorer(codebase_stats=stats)
scored = scorer.score_batch(hypotheses)
# 5. Synthesize SQL queries
synthesizer = QuerySynthesizer()
for h in scored:
h.sql_query = synthesizer.synthesize_query(h)
# 6. Execute on CPG
executor = HypothesisExecutor(conn)
for h in scored[:20]: # Top 20
evidence = executor.execute(h)
h.evidence.extend(evidence)
# 7. Validate and report
validator = HypothesisValidator()
results = validator.validate_batch(scored)
print(f"Detection Rate: {results.detection_rate:.1%}")
print(f"Precision: {results.precision:.1%}")
print(f"F1 Score: {results.f1_score:.2f}")
9. Conclusion¶
CodeGraph’s multi-criteria hypothesis validation system represents a fundamentally new approach to vulnerability detection:
- Contextual analysis — considers specifics of the particular codebase
- Taint verification — confirms data flow through CPG
- Risk prioritization — focus on actually exploitable vulnerabilities
- False positive reduction — from 70-90% down to less than 30%
Result: 100% detection rate for target CVEs with dramatic reduction in false positives.
Related Documents¶
Version: 1.0 | December 2025