Clone Detection

Clone Detection

AST-based clone detection engine that identifies duplicate and near-duplicate code fragments. Detects 4 clone types from exact duplicates to semantic clones using token similarity, AST structure comparison, and control flow analysis.

Overview

Clone detection finds copy-pasted or structurally similar code across the codebase. This helps: - Security: find copy-pasted vulnerable code (e.g., unsafe strcpy patterns) - Quality: reduce code duplication for easier maintenance - Refactoring: identify candidates for extracting shared functions

Clone Types

Type Name Detection Method Default Threshold
Type-1 Exact Token Jaccard similarity 0.95
Type-2 Renamed Normalized token similarity (variables/types replaced with placeholders) 0.80
Type-3 Structural AST structure sequence similarity 0.65
Type-4 Semantic Control flow sequence similarity 0.60

Detection proceeds in order — the highest matching type is returned for each pair.

API Reference

CloneResult

@dataclass
class CloneResult:
    method1_id: int              # First method CPG node ID
    method1_name: str            # First method name
    method1_file: str            # First method file:line
    method2_id: int              # Second method CPG node ID
    method2_name: str            # Second method name
    method2_file: str            # Second method file:line
    similarity: float            # 0.0-1.0 similarity score
    clone_type: str              # 'exact', 'renamed', 'structural', 'semantic'
    shared_patterns: List[str]   # e.g., ['error_handling', 'null_check']
    line_count1: int             # Lines in first method
    line_count2: int             # Lines in second method

ASTCloneDetector

class ASTCloneDetector:
    def __init__(self, cpg_service):
        """Initialize with CPGQueryService for database access."""

    def detect_clones(
        self,
        min_similarity: Optional[float] = None,  # Default: 0.7
        category: str = None,                     # Filter by pattern category
        max_methods: Optional[int] = None,        # Default: 200
        min_lines: Optional[int] = None,          # Default: 5
    ) -> List[CloneResult]:
        """
        Detect clones across the entire codebase.

        Returns:
            List of CloneResult sorted by similarity descending
        """

    def detect_clones_for_category(
        self,
        category: str,
        min_similarity: Optional[float] = None,  # Default: 0.6
    ) -> List[CloneResult]:
        """
        Detect clones within a specific code pattern category.

        Args:
            category: Pattern category (e.g., 'error_handling', 'null_check')
        """

Usage

Basic Detection

from src.analysis.clone_detector import ASTCloneDetector

detector = ASTCloneDetector(cpg_service)
clones = detector.detect_clones(min_similarity=0.7)

for clone in clones:
    print(f"[{clone.clone_type}] {clone.similarity:.0%}")
    print(f"  {clone.method1_name} ({clone.method1_file})")
    print(f"  {clone.method2_name} ({clone.method2_file})")
    print(f"  Shared patterns: {', '.join(clone.shared_patterns)}")

Category-Specific Detection

# Find duplicated error handling code
clones = detector.detect_clones_for_category(
    category="error_handling",
    min_similarity=0.6
)

In Workflow (Scenario S05 Refactoring)

Clone detection is automatically invoked in the refactoring scenario when the query mentions duplicates or clones:

from src.analysis.clone_detector import ASTCloneDetector, detect_duplicate_category

# Auto-detect category from user query
category, patterns = detect_duplicate_category(query)
detector = ASTCloneDetector(cpg)
clones = detector.detect_clones_for_category(category, min_similarity=0.6)

Configuration

# config.yaml
clone_detection:
  min_similarity_strict: 0.7        # General detection threshold
  min_similarity_relaxed: 0.6       # Category-specific threshold
  max_methods_standard: 200          # Methods analyzed (standard mode)
  max_methods_extended: 300          # Methods analyzed (extended mode)
  min_lines_for_clone: 5            # Minimum method size (lines)
  patterns_limit: 3                  # Patterns checked per category
  clones_limit: 20                   # Max clones returned

Pattern Categories

Categories define which code patterns to look for when detecting clones:

Category Example Functions
null_check NULL/nullptr assertions, pointer checks
string_operations strlen, strcmp, strncpy, snprintf
error_handling Error codes, exceptions, errno
memory_allocation malloc, free, realloc patterns
lock_management Mutex lock/unlock, atomic operations
buffer_operations Buffer read/write, memcpy

Categories are loaded from the active domain plugin via domain.get_operation_categories(), making clone detection domain-agnostic.

Algorithm

  1. Load methods from CPG (filtered by min_lines, optional category)
  2. Pairwise comparison (O(n^2), bounded by max_methods): - Extract tokens, normalized tokens, AST structure, control flow - Compute similarity metrics in order:
    • Token Jaccard > 0.95 → Type-1 (exact)
    • Normalized token Jaccard > 0.80 → Type-2 (renamed)
    • AST sequence similarity > 0.65 → Type-3 (structural)
    • Control flow similarity > 0.60 → Type-4 (semantic)
    • Return highest similarity with clone type
  3. Filter by min_similarity threshold
  4. Find shared patterns between methods
  5. Sort by similarity descending, cap at clones_limit

Similarity Metrics

Jaccard(A, B) = |A  B| / |A  B|          # For token sets
Sequence(A, B) = |Counter(A)  Counter(B)| / |Counter(A)  Counter(B)|  # For ordered sequences

Example Output

[renamed] 85% similarity
  validate_user (auth.c:45, 20 lines)
  verify_admin (admin.c:120, 22 lines)
  Shared patterns: null_check, error_handling

[structural] 72% similarity
  process_request (handler.c:89, 35 lines)
  handle_event (event.c:201, 38 lines)
  Shared patterns: error_handling, buffer_operations

Limitations

  1. O(n^2) complexity — bounded by max_methods (default 200) to keep analysis practical
  2. Method-level only — compares whole methods, not arbitrary code blocks
  3. Pattern-based parsing — uses regex-based tokenization, not full language-aware AST
  4. No cross-language — clones detected within same language only

Module: src/analysis/clone_detector.py Last updated: February 2026