Clone Detection¶
AST-based clone detection engine that identifies duplicate and near-duplicate code fragments. Detects 4 clone types from exact duplicates to semantic clones using token similarity, AST structure comparison, and control flow analysis.
Overview¶
Clone detection finds copy-pasted or structurally similar code across the codebase. This helps:
- Security: find copy-pasted vulnerable code (e.g., unsafe strcpy patterns)
- Quality: reduce code duplication for easier maintenance
- Refactoring: identify candidates for extracting shared functions
Clone Types¶
| Type | Name | Detection Method | Default Threshold |
|---|---|---|---|
| Type-1 | Exact | Token Jaccard similarity | 0.95 |
| Type-2 | Renamed | Normalized token similarity (variables/types replaced with placeholders) | 0.80 |
| Type-3 | Structural | AST structure sequence similarity | 0.65 |
| Type-4 | Semantic | Control flow sequence similarity | 0.60 |
Detection proceeds in order — the highest matching type is returned for each pair.
API Reference¶
CloneResult¶
@dataclass
class CloneResult:
method1_id: int # First method CPG node ID
method1_name: str # First method name
method1_file: str # First method file:line
method2_id: int # Second method CPG node ID
method2_name: str # Second method name
method2_file: str # Second method file:line
similarity: float # 0.0-1.0 similarity score
clone_type: str # 'exact', 'renamed', 'structural', 'semantic'
shared_patterns: List[str] # e.g., ['error_handling', 'null_check']
line_count1: int # Lines in first method
line_count2: int # Lines in second method
ASTCloneDetector¶
class ASTCloneDetector:
def __init__(self, cpg_service):
"""Initialize with CPGQueryService for database access."""
def detect_clones(
self,
min_similarity: Optional[float] = None, # Default: 0.7
category: str = None, # Filter by pattern category
max_methods: Optional[int] = None, # Default: 200
min_lines: Optional[int] = None, # Default: 5
) -> List[CloneResult]:
"""
Detect clones across the entire codebase.
Returns:
List of CloneResult sorted by similarity descending
"""
def detect_clones_for_category(
self,
category: str,
min_similarity: Optional[float] = None, # Default: 0.6
) -> List[CloneResult]:
"""
Detect clones within a specific code pattern category.
Args:
category: Pattern category (e.g., 'error_handling', 'null_check')
"""
Usage¶
Basic Detection¶
from src.analysis.clone_detector import ASTCloneDetector
detector = ASTCloneDetector(cpg_service)
clones = detector.detect_clones(min_similarity=0.7)
for clone in clones:
print(f"[{clone.clone_type}] {clone.similarity:.0%}")
print(f" {clone.method1_name} ({clone.method1_file})")
print(f" {clone.method2_name} ({clone.method2_file})")
print(f" Shared patterns: {', '.join(clone.shared_patterns)}")
Category-Specific Detection¶
# Find duplicated error handling code
clones = detector.detect_clones_for_category(
category="error_handling",
min_similarity=0.6
)
In Workflow (Scenario S05 Refactoring)¶
Clone detection is automatically invoked in the refactoring scenario when the query mentions duplicates or clones:
from src.analysis.clone_detector import ASTCloneDetector, detect_duplicate_category
# Auto-detect category from user query
category, patterns = detect_duplicate_category(query)
detector = ASTCloneDetector(cpg)
clones = detector.detect_clones_for_category(category, min_similarity=0.6)
Configuration¶
# config.yaml
clone_detection:
min_similarity_strict: 0.7 # General detection threshold
min_similarity_relaxed: 0.6 # Category-specific threshold
max_methods_standard: 200 # Methods analyzed (standard mode)
max_methods_extended: 300 # Methods analyzed (extended mode)
min_lines_for_clone: 5 # Minimum method size (lines)
patterns_limit: 3 # Patterns checked per category
clones_limit: 20 # Max clones returned
Pattern Categories¶
Categories define which code patterns to look for when detecting clones:
| Category | Example Functions |
|---|---|
null_check |
NULL/nullptr assertions, pointer checks |
string_operations |
strlen, strcmp, strncpy, snprintf |
error_handling |
Error codes, exceptions, errno |
memory_allocation |
malloc, free, realloc patterns |
lock_management |
Mutex lock/unlock, atomic operations |
buffer_operations |
Buffer read/write, memcpy |
Categories are loaded from the active domain plugin via domain.get_operation_categories(), making clone detection domain-agnostic.
Algorithm¶
- Load methods from CPG (filtered by
min_lines, optionalcategory) - Pairwise comparison (O(n^2), bounded by
max_methods): - Extract tokens, normalized tokens, AST structure, control flow - Compute similarity metrics in order:- Token Jaccard > 0.95 → Type-1 (exact)
- Normalized token Jaccard > 0.80 → Type-2 (renamed)
- AST sequence similarity > 0.65 → Type-3 (structural)
- Control flow similarity > 0.60 → Type-4 (semantic)
- Return highest similarity with clone type
- Filter by
min_similaritythreshold - Find shared patterns between methods
- Sort by similarity descending, cap at
clones_limit
Similarity Metrics¶
Jaccard(A, B) = |A ∩ B| / |A ∪ B| # For token sets
Sequence(A, B) = |Counter(A) ∩ Counter(B)| / |Counter(A) ∪ Counter(B)| # For ordered sequences
Example Output¶
[renamed] 85% similarity
validate_user (auth.c:45, 20 lines)
verify_admin (admin.c:120, 22 lines)
Shared patterns: null_check, error_handling
[structural] 72% similarity
process_request (handler.c:89, 35 lines)
handle_event (event.c:201, 38 lines)
Shared patterns: error_handling, buffer_operations
Limitations¶
- O(n^2) complexity — bounded by
max_methods(default 200) to keep analysis practical - Method-level only — compares whole methods, not arbitrary code blocks
- Pattern-based parsing — uses regex-based tokenization, not full language-aware AST
- No cross-language — clones detected within same language only
Related Documentation¶
- Refactoring Guide — using clone detection in refactoring workflows
- Analysis Modules — other analysis modules
- Security Reference — finding copy-pasted vulnerable code
- Scenarios — S05 Refactoring scenario
Module: src/analysis/clone_detector.py
Last updated: February 2026