Security Hypothesis Validation

Security Hypothesis Validation

Technical Document for Security Architects and Researchers


Table of Contents

Abstract

Traditional SAST (Static Application Security Testing) tools suffer from high false positive rates (up to 70-90%), making analysis results practically unusable for real work. CodeGraph solves this problem with a multi-criteria hypothesis validation system that:

  1. Generates testable hypotheses based on CWE/CAPEC knowledge bases
  2. Evaluates hypotheses across three criteria considering codebase context
  3. Verifies vulnerabilities through taint analysis on Code Property Graph
  4. Achieves 100% CVE detection rate while reducing false positives by 60%+

1. The Problem

1.1 Limitations of Traditional SAST

Traditional SAST:
  Pattern: "strcpy" found
  Result: POSSIBLE vulnerability
  False Positive Rate: 70-90%

CodeGraph:
  Hypothesis: Untrusted data flows from recv() to strcpy()
  Evidence: Taint path verified via CPG
  Result: CONFIRMED vulnerability
  False Positive Rate: <30%

1.2 Why Pattern Matching Is Not Enough

Problem Description
No context strcpy is safe if source is a constant
No data flow Doesn’t consider where data comes from
No sanitization Ignores validator functions
No prioritization All findings have equal weight

2. Solution Architecture

2.1 Hypothesis Validation Pipeline

┌─────────────────────────────────────────────────────────────────────────┐
│                    HYPOTHESIS VALIDATION PIPELINE                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ 1. GENERATION                                                   │   │
│  │    HypothesisGenerator.generate()                               │   │
│  │    ├── CWE Database (120+ patterns)                            │   │
│  │    ├── CAPEC Database (50+ attack patterns)                    │   │
│  │    ├── Language Patterns (C, Python, Java)                     │   │
│  │    └── Cartesian Product: CWEs × CAPECs × Patterns             │   │
│  │    Output: SecurityHypothesis[]                                │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              │                                          │
│                              ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ 2. MULTI-CRITERIA SCORING                                       │   │
│  │    MultiCriteriaScorer.score_batch()                           │   │
│  │                                                                  │   │
│  │    Score = CWE_Freq × 0.40 + Attack_Sim × 0.30 + Exposure × 0.30│   │
│  │                                                                  │   │
│  │    Bonuses:                                                     │   │
│  │    ├── Known CVE pattern: ×1.20                                │   │
│  │    ├── Critical severity: ×1.10                                │   │
│  │    └── Recent exploit: ×1.15                                   │   │
│  │                                                                  │   │
│  │    Output: Prioritized hypotheses                               │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              │                                          │
│                              ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ 3. QUERY SYNTHESIS                                              │   │
│  │    QuerySynthesizer.synthesize()                                │   │
│  │    ├── Match hypothesis to SQL template                        │   │
│  │    ├── Parameter substitution                                  │   │
│  │    └── Output: DuckDB SQL / PGQ queries                        │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              │                                          │
│                              ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ 4. EXECUTION                                                    │   │
│  │    HypothesisExecutor.execute()                                 │   │
│  │    ├── Run queries against CPG                                 │   │
│  │    ├── Collect evidence                                        │   │
│  │    └── Output: Evidence[]                                      │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                              │                                          │
│                              ▼                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ 5. VALIDATION                                                   │   │
│  │    HypothesisValidator.validate()                               │   │
│  │    ├── Analyze evidence                                        │   │
│  │    ├── Update hypothesis status (CONFIRMED/REJECTED)           │   │
│  │    └── Calculate precision/recall metrics                      │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

3. Multi-Criteria Scoring Model

3.1 Prioritization Formula

Priority Score = (CWE_Frequency × 0.40)
               + (Attack_Similarity × 0.30)
               + (Codebase_Exposure × 0.30)

3.2 Scoring Components

CWE Frequency Score (40%)

Evaluates how often this vulnerability appears in real CVEs.

def _score_cwe_frequency(cwe_ids: List[str]) -> float:
    """
    Components:
    - prevalence: Frequency in CVE database (0.0-1.0)
    - exploitability: How easy to exploit (0.0-1.0)
    - cvss_base: CVSS base score / 10 (0.0-1.0)

    Score = prevalence × 0.4 + exploitability × 0.4 + cvss × 0.2
    """
CWE Prevalence Exploitability CVSS Score
CWE-120 (Buffer Overflow) 0.85 0.90 8.0 0.86
CWE-78 (Command Injection) 0.75 0.95 9.8 0.88
CWE-89 (SQL Injection) 0.90 0.95 9.8 0.94
CWE-200 (Info Disclosure) 0.60 0.70 5.3 0.63

Attack Similarity Score (30%)

Evaluates how well the hypothesis matches known attack patterns from CAPEC.

def _score_attack_similarity(capec_ids: List[str]) -> float:
    """
    Components:
    - likelihood: Attack probability (0.0-1.0)
    - skill_level: Required skill level
      - Low: ×1.0 (higher risk)
      - Medium: ×0.8
      - High: ×0.6
      - Expert: ×0.4 (lower risk)

    Score = likelihood × skill_adjustment
    """

Codebase Exposure Score (30%)

Evaluates how exposed the specific codebase is to this vulnerability.

def _score_codebase_exposure(hypothesis) -> float:
    """
    Components:
    - sink_exposure: Presence of dangerous sink functions
    - source_exposure: Presence of external data sources
    - sanitizer_coverage: Presence of sanitizer functions (lowers risk)
    - taint_paths: Number of source → sink paths

    Exposure = (sink × 0.4 + source × 0.4) × (1 - sanitizer × 0.5)
    """

3.3 Bonus Multipliers

Bonus Multiplier Condition
Known CVE ×1.20 Pattern matches known CVE
Critical Severity ×1.10 CWE has critical severity
Recent Exploit ×1.15 Recent exploitation in wild

4. Codebase Statistics

4.1 Statistics Collection from CPG

@dataclass
class CodebaseStats:
    total_methods: int        # Total method count
    total_calls: int          # Total call count

    sink_counts: Dict[str, int]       # sink_name → count
    source_counts: Dict[str, int]     # source_name → count
    sanitizer_counts: Dict[str, int]  # sanitizer_name → count

    taint_paths: int          # Number of source→sink paths

4.2 Tracked Functions

Dangerous Sinks (C):

strcpy, strcat, sprintf, gets, memcpy
system, popen, execl, execv
printf, fprintf (format string)
appendPQExpBuffer, SPI_execute, PQexec (PostgreSQL)

Untrusted Sources:

recv, read, fgets, getenv
PQgetvalue, SPI_getvalue, getTables (PostgreSQL)

Sanitizers:

strlcpy, snprintf
fmtId, quote_identifier, quote_literal
pg_class_aclcheck

5. Taint Analysis on CPG

5.1 Data Flow Verification

-- Find unvalidated paths from source to sink
FROM GRAPH_TABLE(cpg
    MATCH (src:CALL)-[:REACHING_DEF*1..10]->(sink:CALL)
    WHERE src.name IN ('recv', 'getenv', 'PQgetvalue')
      AND sink.name IN ('strcpy', 'sprintf', 'system')
    COLUMNS (
        src.name AS source,
        sink.name AS sink,
        sink.filename,
        sink.line_number
    )
)

5.2 Sanitization Check

-- Check for sanitizers on the path
SELECT h.id, h.source, h.sink,
       EXISTS (
           SELECT 1 FROM nodes_call nc
           WHERE nc.name IN ('strlcpy', 'snprintf', 'quote_identifier')
             AND nc.line_number BETWEEN h.source_line AND h.sink_line
             AND nc.filename = h.filename
       ) AS has_sanitizer
FROM hypothesis_paths h;

6. Validation Results

6.1 Benchmark on PostgreSQL 17

Metric Value
CVE Detection Rate 100% (3/3)
Hypothesis Confirmation Rate 55%
Average Query Time 2-3 ms
Generation Time (100 hyp.) <1 sec
Execution Time (20 hyp.) <30 sec

6.2 Detected CVEs

CVE ID Type Detection Method
CVE-2025-8713 Statistics Disclosure Hypothesis + Taint
CVE-2025-8714 pg_dump Injection Method-based
CVE-2025-8715 Newline Injection Method-based

6.3 Comparison with Traditional SAST

Tool True Positives False Positives Precision
Pattern SAST 3 45 6.25%
CodeGraph 3 2 60%

7. Hypothesis Structure

7.1 SecurityHypothesis

@dataclass
class SecurityHypothesis:
    id: str                          # Unique identifier
    hypothesis_text: str             # Hypothesis text

    # Classification
    cwe_ids: List[str]              # ["CWE-120", "CWE-119"]
    capec_ids: List[str]            # ["CAPEC-100"]
    language: str                   # "C", "Python"
    category: str                   # "buffer_overflow"

    # Taint patterns
    source_patterns: List[str]      # ["PQgetvalue", "getenv"]
    sink_patterns: List[str]        # ["strcpy", "memcpy"]
    sanitizer_patterns: List[str]   # ["strlcpy", "sizeof"]

    # Scoring
    priority_score: float           # 0.0-1.0+
    confidence: float               # 0.0-1.0

    # Multi-criteria breakdown
    cwe_frequency_score: float
    attack_similarity_score: float
    codebase_exposure_score: float

    # Validation
    sql_query: Optional[str]
    evidence: List[Evidence]
    validation_status: ValidationStatus

7.2 Hypothesis Format

"If untrusted data from {sources} flows to {sinks}
 without sanitization via {sanitizers},
 then {cwe_id} enables {capec_id} attack,
 potentially allowing {impact}."

Example:

"If untrusted data from PQgetvalue() flows to strcpy()
 without bounds checking via strlcpy(),
 then CWE-120 enables CAPEC-100 (Buffer Overflow) attack,
 potentially allowing memory corruption or code execution."

8. Integration API

8.1 Full Example

from src.security.hypothesis import (
    HypothesisGenerator,
    MultiCriteriaScorer,
    QuerySynthesizer,
    HypothesisExecutor,
    HypothesisValidator,
    CodebaseStats,
    compute_codebase_stats_from_duckdb
)
import duckdb

# 1. Connect to CPG
conn = duckdb.connect("cpg.duckdb")

# 2. Gather codebase statistics
stats = compute_codebase_stats_from_duckdb("cpg.duckdb")

# 3. Generate hypotheses
generator = HypothesisGenerator()
hypotheses = generator.generate(
    language="C",
    cwe_filter=["CWE-120", "CWE-78", "CWE-89"],
    max_hypotheses=100
)

# 4. Multi-criteria scoring
scorer = MultiCriteriaScorer(codebase_stats=stats)
scored = scorer.score_batch(hypotheses)

# 5. Synthesize SQL queries
synthesizer = QuerySynthesizer()
for h in scored:
    h.sql_query = synthesizer.synthesize_query(h)

# 6. Execute on CPG
executor = HypothesisExecutor(conn)
for h in scored[:20]:  # Top 20
    evidence = executor.execute(h)
    h.evidence.extend(evidence)

# 7. Validate and report
validator = HypothesisValidator()
results = validator.validate_batch(scored)

print(f"Detection Rate: {results.detection_rate:.1%}")
print(f"Precision: {results.precision:.1%}")
print(f"F1 Score: {results.f1_score:.2f}")

9. Conclusion

CodeGraph’s multi-criteria hypothesis validation system represents a fundamentally new approach to vulnerability detection:

  1. Contextual analysis — considers specifics of the particular codebase
  2. Taint verification — confirms data flow through CPG
  3. Risk prioritization — focus on actually exploitable vulnerabilities
  4. False positive reduction — from 70-90% down to less than 30%

Result: 100% detection rate for target CVEs with dramatic reduction in false positives.



Version: 1.0 | December 2025