Scenario 10: Cross-Repository Analysis

Architect analyzing dependencies, duplications, and consolidation opportunities across multiple repositories using CPG-based agents and graph methods.

Table of Contents

Quick Start

# Select Cross-Repository Scenario
/select 10

How It Works

Two-Phase Architecture

S10 uses the standard two-phase architecture with handler-based Phase 1 and LLM fallback Phase 2:

Query -> CrossRepoIntentDetector.detect()
  |
  Phase 1: integrate_handlers(state)
    -> HandlerRegistry("cross_repo") -> match handler by intent
    -> handler.handle() -> CrossRepoReportFormatter -> answer
    |
    handled=True  -> return formatted result (no LLM)
    handled=False -> Phase 2
  |
  Phase 2: cross_repo_workflow() [LLM fallback]
    -> 3 Agents (RepositoryIndexer, CrossRepoAnalyzer, DependencyMapper)
    -> CallGraphAnalyzer (graph insights)
    -> PromptRegistry -> LLMInterface.generate() -> answer

cross_repo_workflow() in cross_repo.py first calls integrate_handlers(state). If a handler matches (handled=True), the result is returned without LLM. Otherwise, the full LLM-based workflow executes with 3 agents and graph analysis.

Intent Detection

CrossRepoIntentDetector(IntentDetector) in cross_repo_handlers/intent_detector.py defines 6 intents sorted by priority:

Intent Priority Keywords (EN + RU)
hook_usage 5 hook, extension hook, callback, хук, точка расширения
spi_dependency 8 spi, server programming interface, серверный интерфейс
code_consolidation 10 consolidate, merge, combine, unified, консолидация, объединить
duplicate_detection 20 duplicate, clone, copy, redundant, дубликат, клон
cross_repo_search 30 across repo, cross repo, all repo, кросс-репозиторий
consistency_check 40 consistency, standard, naming convention, согласованность

Fallback: general_cross_repo (confidence=0.5) when no pattern matches.

Keyword matching uses keyword_match_morphological() for Russian lemma support. Domain-specific hook keywords are loaded dynamically via _get_domain_hook_keywords() from the active domain plugin.

Handler Phase

3 Handlers

cross_repo_handlers/workflow.py registers 3 handlers in HandlerRegistry("cross_repo"):

Handler Priority Intent Description
ExtensionDependenciesHandler 5 hook_usage, spi_dependency Extension API and SPI dependency analysis
ConsolidationHandler 10 code_consolidation Code consolidation candidate detection
DuplicateDetectionHandler 20 duplicate_detection Duplicate code detection across files

All handlers inherit from CrossRepoHandler(BaseHandler).

CrossRepoHandler Base Methods

CrossRepoHandler in cross_repo_handlers/handlers/base.py provides 4 shared CPG query methods:

Method Description
_find_duplicate_code(min_similarity, limit) Find duplicate code patterns via signature matching
_search_across_files(pattern, scope, limit) Search for methods matching a pattern across files
_check_naming_consistency(pattern_type) Check naming convention consistency (snake_case, camelCase, PascalCase)
_analyze_consolidation_candidates(threshold) Find code consolidation candidates grouped by signature

ExtensionDependenciesHandler additionally uses get_extension_dependency_patterns_from_plugin() and build_sql_like_clause() from _plugin_helpers to load domain-specific extension API patterns (domain-agnostic approach).

Report Formatters

CrossRepoReportFormatter(CrossRepoFormatter) in cross_repo_handlers/formatters/cross_repo_report.py provides 2 report formatters:

Method Used by
format_consolidation_report(report_data, language) ConsolidationHandler
format_duplicate_report(report_data, language) DuplicateDetectionHandler

CrossRepoFormatter(BaseFormatter) provides helper methods: format_file_list(), format_consolidation_badge(), format_similarity_badge(). All formatters support EN/RU localization.

LLM Phase

3 Agents

When no handler matches, cross_repo_workflow() executes the full LLM pipeline with 3 agents from src/cross_repo/cross_repo_agents.py:

Agent Role Key Methods
RepositoryIndexer Discover repos, extract metadata, index into CPG discover_repositories(path), index_repository_cpg(repo)
CrossRepoAnalyzer Detect code duplications and consolidation opportunities find_code_duplications(repos, min_similarity, min_lines), find_similar_utilities(repos), identify_consolidation_opportunities(dups)
DependencyMapper Map dependencies, detect circular deps, generate report map_dependencies(repos), generate_dependency_graph(deps), detect_circular_dependencies(graph), generate_dependency_report(...)

The pipeline: 1. RepositoryIndexer discovers and indexes repositories from workspace_path 2. CrossRepoAnalyzer finds duplications and consolidation opportunities 3. DependencyMapper maps dependencies, detects circular dependencies, generates report 4. Evidence list built from duplications, opportunities, and high-risk dependencies 5. PromptRegistry.get_agent_prompt("cross_repo_analyzer", ...) builds the prompt 6. LLMInterface().generate() produces the final answer

CallGraphAnalyzer Integration

After the 3 agents complete, CallGraphAnalyzer(cpg) from src/analysis performs graph-based analysis:

  • Shared methods: For each duplication, analyzes callers/callees/impact via find_all_callers(), find_all_callees(), analyze_impact(). Calculates consolidation_score = (callers + callees) * instances.
  • Cross-repo calls: For dependencies with source_method, finds callees to detect tight coupling. Flags high-coupling dependencies for decoupling priority.
  • Consolidation patterns: Groups methods by call graph signature (set of callees). Methods with identical call patterns are consolidation candidates.

Graph Insights

Graph analysis produces 3 insight categories stored in state["metadata"]["graph_insights"]:

Category Description
shared_methods Methods with consolidation_benefit (high/medium/low) and consolidation_score
cross_repo_calls High-coupling dependencies with decoupling_priority
consolidation_patterns Method groups with same call signature — “Extract to shared library” candidates

Data Models

Key dataclasses from src/cross_repo/repo_patterns.py:

Model Key Fields
RepositoryInfo repo_id, name, language, method_count, file_count
CodeDuplication pattern_name, similarity_score, severity, instances (List[CodeInstance]), potential_savings
CrossRepoDependency source_repo, target_repo, dependency_type (DependencyType), coupling_score, risk_level (RiskLevel)
ConsolidationOpportunity title, priority, estimated_savings, estimated_effort
ConsolidationReport total_repos, total_methods, risk_summary, estimated_total_savings

Enums: DependencyType (import, function_call, type_reference, …), RiskLevel (critical, high, medium, low).

Configuration

S10 is domain-agnostic. All domain-specific data is loaded from plugins:

  • _get_api_keywords() — API keyword-to-function mapping from domain.get_api_keywords()
  • _get_domain_hook_keywords() — hook-specific keywords from domain.get_intent_keywords("CROSS_REPO")
  • get_extension_dependency_patterns_from_plugin() — extension dependency patterns

Project configuration in config.yamlprojects:

projects:
  registry:
    postgres:
      db_path: data/projects/postgres.duckdb
      language: c
      domain: postgres
    extension1:
      db_path: data/projects/extension1.duckdb
      language: c
      domain: postgres

CLI Usage

# Cross-repo dependency analysis
python -m src.cli query "Find cross-repository dependencies"

# Duplicate detection
python -m src.cli query "Find duplicate code across repositories"

# Consolidation analysis
python -m src.cli query "Show consolidation opportunities"

# Extension dependency analysis
python -m src.cli query "Which extensions depend on memory allocation functions?"

# Hook usage analysis
python -m src.cli query "Find all hook usage patterns"

# Consistency check
python -m src.cli query "Check naming consistency across codebase"

Example Questions

  • “Find cross-repository dependencies”
  • “Find duplicate code across repositories”
  • “Show consolidation opportunities”
  • “Which extensions use [function_name]?”
  • “Find all hook usage patterns”
  • “Check naming consistency across the codebase”
  • “What would break if [function] changes?”
  • “Show high-risk dependencies”
  • “Find functions used by extensions”

S10 vs S11: S10 (Cross-Repository) focuses on multi-repository analysis — duplications across repos, inter-repo dependencies, consolidation opportunities, and cross-repo call graph analysis. S11 (Architecture) focuses on internal architecture of a single project — layer analysis, coupling/cohesion metrics, circular dependencies within one codebase.