Version: 3.0 Language: English
Table of Contents¶
- Overview
- Key Features
- Architecture
- Getting Started
- Taint Analysis
- Memory Safety Analysis
- NULL Pointer Detection
- Information Disclosure Detection
- Race Condition Detection
- Advanced Features
- Configuration
- Troubleshooting
- Performance Benchmarks
- API Reference
- Additional Resources
Overview¶
CodeGraph’s dataflow analysis engine provides precise vulnerability detection through: - Taint Analysis - Track untrusted data from sources to sinks (94-96% accuracy) - Memory Safety - Detect use-after-free and double-free bugs (85% with alias analysis) - NULL Pointer Detection - Find potential NULL dereferences (85-90% accuracy) - Information Disclosure - Detect sensitive data leaks (80-85% accuracy) - Race Condition Detection - Find TOCTOU and data races (80-85% accuracy)
All numeric defaults (max_paths, max_depth, thresholds) are loaded from config.yaml via get_unified_config().
Key Features¶
Core Analysis Phases¶
- Inter-Procedural Analysis - Tracks taint across function calls (Phase 4.2)
- Control-Flow Awareness - Analyzes conditional execution (Phase 4.3)
- Field-Sensitive Analysis - Tracks struct field taint (Phase 4.4)
- Context-Sensitive Analysis - Distinguishes calls from different callers (Phase 5)
- Sanitization Detection - Identifies proper input sanitization
Advanced Enhancements¶
- Cross-Language Support - 11 languages via GoCPG
- Pointer Alias Analysis - Improved UAF detection (+15% accuracy)
- Symbolic Execution V2 - Path feasibility checking via Z3 (-10-15% false positives)
Architecture¶
Core Components¶
TaintPropagator (Main Engine)
+-- DataflowTracker - Forward dataflow propagation
+-- InterProcTracker - Inter-procedural tracking
+-- ControlFlowAnalyzer - Control dependency analysis
+-- FieldSensitiveTracker - Struct field tracking
+-- ContextSensitiveTracker - Call context tracking (Phase 5)
+-- PathConstraintTracker - Symbolic execution via Z3
MemoryLifetimeAnalyzer (Memory Safety)
+-- PointerAliasAnalyzer - Andersen's alias analysis
Analysis Flow¶
1. Load domain-specific sources/sinks from plugin YAML
2. Find source nodes (user input, env, network, etc.)
3. Find sink nodes (SQL, shell, file ops, etc.)
4. Propagate taint through dataflow edges
5. Track call contexts for precision (if enabled)
6. Build pointer alias graph (if enabled, memory analysis only)
7. Filter infeasible paths with Z3 (if enabled)
8. Calculate risk scores and confidence
9. Return sorted vulnerability paths
Getting Started¶
Prerequisites¶
Required:
pip install -r requirements.txt
Optional (for symbolic execution):
pip install z3-solver
Basic Usage¶
from src.analysis.dataflow import analyze_sql_injections
from src.services.cpg import CPGQueryService
with CPGQueryService() as cpg:
# Analyze SQL injection vulnerabilities
# All defaults come from config.yaml -> taint_analysis.*
paths = analyze_sql_injections(cpg)
# Display results
for path in paths:
print(f"SQL Injection: {path.source_function} -> {path.sink_function}")
print(f" Risk: {path.risk_level} ({path.risk_score:.2f})")
print(f" Confidence: {path.confidence:.2f}")
print(f" Location: {path.source_location} -> {path.sink_location}")
Taint Analysis¶
SQL Injection Detection¶
from src.analysis.dataflow import analyze_sql_injections
paths = analyze_sql_injections(cpg)
for path in paths:
# Source: where untrusted data comes from
print(f"Source: {path.source_function} at {path.source_location}")
# Sink: where it's used dangerously
print(f"Sink: {path.sink_function} at {path.sink_location}")
# Sanitization: was it properly sanitized?
if path.sanitizers:
print(f"Sanitizers: {', '.join(path.sanitizers)}")
print(f"Sanitization score: {path.sanitization_score:.2f}")
else:
print("No sanitization detected!")
# Path details
print(f"Path length: {path.path_length} hops")
if path.inter_procedural:
print(f"Functions crossed: {', '.join(path.functions_crossed)}")
Command Injection Detection¶
from src.analysis.dataflow import analyze_command_injections
# Detects taint flows to OS command sinks (system, popen, exec, subprocess.run)
# Sinks defined per-domain in annotations.yaml -> sink_type: command_injection
paths = analyze_command_injections(cpg)
for path in paths:
print(f"Command Injection (CWE-78): {path.source_function} -> {path.sink_function}")
print(f" Risk: {path.risk_level} ({path.risk_score:.2f})")
print(f" Recommendation: {path.recommendation}")
Path Traversal Detection¶
from src.analysis.dataflow import analyze_path_traversal
# Detects taint flows to filesystem sinks (fopen, open, access, stat, os.path.join)
# Sinks defined per-domain in annotations.yaml -> sink_type: path_traversal
paths = analyze_path_traversal(cpg)
for path in paths:
print(f"Path Traversal (CWE-22): {path.source_function} -> {path.sink_function}")
print(f" Risk: {path.risk_level} ({path.risk_score:.2f})")
print(f" Recommendation: {path.recommendation}")
Custom Taint Analysis¶
from src.analysis.dataflow.taint_analysis import TaintPropagator
with CPGQueryService() as cpg:
# Create propagator with custom settings
propagator = TaintPropagator(
cpg,
enable_inter_proc=True, # Inter-procedural tracking (Phase 4.2)
enable_control_flow=True, # Control-flow analysis (Phase 4.3)
enable_field_sensitive=True, # Field-sensitive tracking (Phase 4.4)
enable_context_sensitive=True, # Call context tracking (Phase 5)
enable_symbolic_execution=True # Path feasibility checking via Z3
)
# Find taint paths (defaults from config.yaml -> taint_analysis.*)
paths = propagator.find_taint_paths(
source_category='sql',
timeout_seconds=60
)
Memory Safety Analysis¶
Use-After-Free Detection¶
from src.analysis.dataflow import analyze_use_after_free
from src.services.cpg import CPGQueryService
with CPGQueryService() as cpg:
from src.analysis.dataflow.memory_lifetime import MemoryLifetimeAnalyzer
analyzer = MemoryLifetimeAnalyzer(
cpg,
enable_alias_analysis=True # Track aliased pointers via Andersen's algorithm
)
paths = analyzer.analyze_use_after_free(
max_paths=100,
max_hops=15,
min_confidence=0.6
)
for path in paths:
print(f"Use-After-Free:")
print(f" Allocation: {path.allocation_function} at {path.allocation_location}")
print(f" Free: {path.free_function} at {path.free_location}")
print(f" Use: {path.use_type} at {path.use_location}")
print(f" Pointer: {path.pointer_name}")
print(f" Risk: {path.risk_score:.2f}")
print(f" Confidence: {path.confidence:.2f}")
Double-Free Detection¶
from src.analysis.dataflow import analyze_double_free
with CPGQueryService() as cpg:
analyzer = MemoryLifetimeAnalyzer(cpg)
paths = analyzer.analyze_double_free(
max_paths=50,
min_confidence=0.6
)
for path in paths:
print(f"Double-Free:")
print(f" Allocation: {path.allocation_function} at {path.allocation_location}")
print(f" First free: {path.first_free_location}")
print(f" Second free: {path.second_free_location}")
print(f" Risk: {path.risk_score:.2f}")
Pointer Alias Analysis¶
from src.analysis.dataflow.pointer_alias import PointerAliasAnalyzer
with CPGQueryService() as cpg:
analyzer = PointerAliasAnalyzer(cpg)
# Analyze function for pointer aliases
points_to = analyzer.analyze(function_id=123)
# Check if two pointers may alias
if analyzer.may_alias(ptr1_id=456, ptr2_id=457):
print("Pointers may alias!")
# Get all aliases
aliases = analyzer.get_aliases(ptr1_id=456)
print(f"Aliases: {aliases}")
# Get statistics
stats = analyzer.get_statistics()
print(f"Analyzed {stats['total_pointers']} pointers")
print(f"Found {stats['total_allocations']} allocations")
NULL Pointer Detection¶
from src.analysis.dataflow import analyze_null_dereferences
with CPGQueryService() as cpg:
paths = analyze_null_dereferences(
cpg,
max_paths=100,
max_hops=10,
min_confidence=0.7
)
for path in paths:
print(f"NULL Dereference:")
print(f" Source: {path.source_function}() at {path.source_location}")
print(f" Dereference: {path.dereference_type} at {path.dereference_location}")
print(f" Has NULL check: {path.has_null_check}")
print(f" Risk: {path.risk_score:.2f}")
print(f" Variable: {path.variable_name}")
Information Disclosure Detection¶
from src.analysis.dataflow import analyze_info_disclosure
with CPGQueryService() as cpg:
paths = analyze_info_disclosure(
cpg,
max_paths=100,
max_hops=10,
min_risk_score=0.6
)
for path in paths:
print(f"Information Disclosure:")
print(f" Data category: {path.data_category}") # credentials, pii, secrets
print(f" Source: {path.source_function} at {path.source_location}")
print(f" Sink: {path.sink_function} at {path.sink_location}")
print(f" Disclosure type: {path.disclosure_type}") # error_message, debug_log
print(f" Severity: {path.severity}")
print(f" Risk: {path.risk_score:.2f}")
Race Condition Detection¶
from src.analysis.dataflow import analyze_race_conditions
with CPGQueryService() as cpg:
paths = analyze_race_conditions(
cpg,
max_paths=100,
max_hops=15,
min_confidence=0.7
)
for path in paths:
print(f"Race Condition:")
print(f" Type: {path.race_type}") # toctou, data_race, missing_lock
print(f" Check: {path.check_function} at {path.check_location}")
print(f" Use: {path.use_function} at {path.use_location}")
print(f" Resource: {path.shared_resource}")
print(f" Has lock: {path.has_lock}")
print(f" Risk: {path.risk_score:.2f}")
Advanced Features¶
Symbolic Execution¶
from src.analysis.dataflow.symbolic_execution import PathConstraintTracker
with CPGQueryService() as cpg:
tracker = PathConstraintTracker(cpg, max_constraints=20)
# Check if specific path is feasible
is_feasible, confidence = tracker.is_path_feasible(
start_node=100,
end_node=200
)
if not is_feasible and confidence > 0.9:
print("Path is definitely infeasible (false positive)")
# Filter infeasible paths from list
all_paths = analyze_sql_injections(cpg)
feasible_paths = tracker.filter_infeasible_paths(
all_paths,
min_confidence=0.8
)
print(f"Filtered {len(all_paths) - len(feasible_paths)} infeasible paths")
TaintPropagator automatically loads SymbolicExecutionConfig from config.yaml -> symbolic_execution.* with settings: solver_timeout_ms (500), solver_timeout_uf_ms (2000), max_parse_depth (10), enable_function_models (True), enable_arithmetic (True).
Cross-Language Analysis¶
from src.domains import DomainRegistry
# Activate domain for your language
DomainRegistry.activate("python_django") # or: javascript, go, postgresql
# Analysis automatically adapts to domain
paths = analyze_info_disclosure(cpg)
# Uses Python-specific sources/sinks (os.getenv, logger.error, etc.)
Field-Sensitive Analysis¶
paths = propagator.find_taint_paths(source_category='sql')
for path in paths:
if path.field_sensitive:
print(f"Tainted fields: {', '.join(path.tainted_fields)}")
# Example: ['user.email', 'user.password']
Control-Flow Analysis¶
for path in paths:
if path.is_conditional:
print(f"Control dependencies: {len(path.control_dependencies)}")
print(f"Execution probability: {path.execution_probability:.2f}")
for dep in path.control_dependencies:
print(f" Depends on: {dep.control_code} at {dep.control_location}")
Configuration¶
Domain Selection¶
Edit config.yaml:
domain:
name: python_django # or: javascript, go, postgresql
Analyzer Options¶
# Disable specific features for performance
propagator = TaintPropagator(
cpg,
enable_inter_proc=False, # Disable inter-procedural (faster)
enable_control_flow=False, # Disable control-flow (faster)
enable_field_sensitive=False, # Disable field tracking (faster)
enable_context_sensitive=False, # Disable context tracking (faster)
enable_symbolic_execution=False # Disable path feasibility (faster)
)
# Adjust analysis depth (defaults from config.yaml -> taint_analysis.*)
paths = propagator.find_taint_paths(
max_paths=50, # Fewer paths (faster)
max_depth=10 # Shallower search (faster)
)
Baseline Comparison¶
# Compare accuracy with and without enhancements
baseline_analyzer = MemoryLifetimeAnalyzer(cpg, enable_alias_analysis=False)
enhanced_analyzer = MemoryLifetimeAnalyzer(cpg, enable_alias_analysis=True)
baseline_paths = baseline_analyzer.analyze_use_after_free()
enhanced_paths = enhanced_analyzer.analyze_use_after_free()
print(f"Baseline: {len(baseline_paths)} bugs found")
print(f"Enhanced: {len(enhanced_paths)} bugs found")
Troubleshooting¶
Z3 Solver Not Available¶
Error: Symbolic execution disabled: No module named 'z3'
Solution:
pip install z3-solver
If you don’t need symbolic execution:
propagator = TaintPropagator(cpg, enable_symbolic_execution=False)
High Memory Usage¶
Symptom: Analysis crashes with OOM error
Solutions:
# 1. Reduce max paths
paths = analyzer.analyze_use_after_free(max_paths=50)
# 2. Reduce max depth
paths = analyzer.analyze_use_after_free(max_hops=10)
# 3. Disable expensive features
analyzer = MemoryLifetimeAnalyzer(cpg, enable_alias_analysis=False)
Slow Analysis¶
Symptom: Analysis takes >10 seconds
Solutions:
# 1. Disable symbolic execution (saves 20% time)
propagator = TaintPropagator(cpg, enable_symbolic_execution=False)
# 2. Disable pointer alias analysis (saves 50% time)
analyzer = MemoryLifetimeAnalyzer(cpg, enable_alias_analysis=False)
# 3. Reduce analysis depth
paths = propagator.find_taint_paths(max_depth=10)
No Paths Found¶
Symptom: [] returned from analysis
Possible causes:
1. Wrong domain selected (check config.yaml)
2. No sources/sinks in codebase
3. min_risk_score threshold too high
Solution:
# Lower risk threshold (default from config.yaml)
paths = analyze_sql_injections(cpg, min_risk_score=0.1)
# Check domain
from src.domains import get_active_domain
print(f"Active domain: {get_active_domain().name}")
# Verify sources/sinks via public API
sources = propagator.find_source_nodes()
sinks = propagator.find_sink_nodes()
print(f"Sources: {len(sources)}, Sinks: {len(sinks)}")
Performance Benchmarks¶
Environment: 10K methods CPG, Intel i7, 16GB RAM
| Analysis Type | Time | Memory | Paths Found |
|---|---|---|---|
| SQL Injection | 2-4s | 80-100MB | 50-75 |
| Command Injection | 2-4s | 80-100MB | 30-60 |
| Path Traversal | 2-4s | 80-100MB | 20-50 |
| UAF (baseline) | 2-4s | 80-100MB | 50-75 |
| UAF (with alias) | 3-6s | 120-150MB | 85-100 |
| NULL Dereference | 2-3s | 50-80MB | 40-60 |
| Info Disclosure | 3-5s | 80-120MB | 60-90 |
| Race Conditions | 2-4s | 60-100MB | 30-50 |
With Symbolic Execution: +20% time, +25% memory
API Reference¶
Quick Reference¶
# Taint analysis (defaults from config.yaml -> taint_analysis.*)
analyze_sql_injections(cpg, max_paths=None, min_risk_score=None, max_depth=None)
analyze_command_injections(cpg, max_paths=None, min_risk_score=None, max_depth=None)
analyze_path_traversal(cpg, max_paths=None, min_risk_score=None, max_depth=None)
# Memory safety
analyze_use_after_free(cpg, max_paths=100, max_hops=None, min_confidence=0.6)
analyze_double_free(cpg, max_paths=100, max_hops=None, min_confidence=0.6)
# NULL pointer
analyze_null_dereferences(cpg, max_paths=100, max_hops=None, min_confidence=0.7)
# Information disclosure
analyze_info_disclosure(cpg, max_paths=100, max_hops=None, min_risk_score=0.6)
# Race conditions
analyze_race_conditions(cpg, max_paths=100, max_hops=None, min_confidence=0.7)
For detailed API documentation, see:
- src/analysis/dataflow/__init__.py
- src/analysis/dataflow/taint_analysis.py
- src/analysis/dataflow/memory_lifetime.py
- src/analysis/dataflow/pointer_alias.py
- src/analysis/dataflow/symbolic_execution.py
Taint Visualization¶
Taint analysis results can be rendered as Mermaid flowcharts via src/security/taint_visualizer.py, providing visual source-to-sink data flow diagrams. SARIF 2.1.0 export (--sarif-file) includes full codeFlows with step-by-step taint propagation.
Cross-Language Support¶
GoCPG supports 11 languages: C, C++, Go, Python, JavaScript, TypeScript, Java, Kotlin, C#, PHP, and 1C:Enterprise. Cross-language FFI edges (CGO, ctypes, cffi) connect taint paths across language boundaries.
Note: Always use
ProjectManager.get_active_db_path()to get the active project’s DB path – seesrc/project_manager.py.
Additional Resources¶
- CLI Guide:
docs/guides/en/CLI_GUIDE.md - SQL Query Cookbook:
docs/guides/en/SQL_QUERY_COOKBOOK.md - Troubleshooting:
docs/guides/en/TROUBLESHOOTING.md
Questions? See docs/guides/en/TROUBLESHOOTING.md or contact the development team.
Last updated: March 2026