External Context Integration Guide

External Context Integration Guide

Integration guide for linking CPG code entities with external systems (Git, Issue Trackers, APM).

Table of Contents

Overview

External Context Integration allows you to link code entities (methods, functions, classes) in your Code Property Graph (CPG) with metadata from external systems:

  • Git: Author information, commit history, code churn
  • Issue Trackers: Jira, GitHub Issues, GitLab Issues
  • APM/Error Tracking: Sentry error data, frequency, severity

This enables powerful queries like: - “Who wrote this code?” - “What issues are linked to this function?” - “Which methods have the most production errors?” - “What code changes most frequently?”

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    External Systems                          │
├─────────────┬─────────────────────┬─────────────────────────┤
│    Git      │   Issue Trackers    │        Sentry           │
│  (commits)  │  (Jira/GitHub/GL)   │       (errors)          │
└──────┬──────┴──────────┬──────────┴────────────┬────────────┘
       │                 │                       │
       ▼                 ▼                       ▼
┌─────────────────────────────────────────────────────────────┐
│              ExternalContextOrchestrator                     │
│  ┌─────────────┐ ┌───────────────┐ ┌───────────────────┐   │
│  │GitSyncService│ │IssueSyncService│ │SentrySyncService │   │
│  └─────────────┘ └───────────────┘ └───────────────────┘   │
└─────────────────────────────────────────────────────────────┘
       │                                         │
       ▼                                         ▼
┌──────────────────┐                  ┌──────────────────────┐
│   PostgreSQL     │                  │      ChromaDB        │
│ (raw metadata)   │                  │ (semantic search)    │
└────────┬─────────┘                  └──────────┬───────────┘
         │                                       │
         └───────────────────┬───────────────────┘
                             ▼
                    ┌─────────────────┐
                    │   DuckDB CPG    │
                    │  (nodes_tag)    │
                    └─────────────────┘

Quick Setup

1. Install Dependencies

pip install -r requirements.txt

2. Configure Environment Variables

# Git (no additional config needed - uses local git)

# GitHub Issues
export GITHUB_TOKEN="ghp_xxxxxxxxxxxx"

# GitLab Issues
export GITLAB_TOKEN="glpat-xxxxxxxxxxxx"

# Jira
export JIRA_TOKEN="your_jira_api_token"
export JIRA_EMAIL="your@email.com"

# Sentry
export SENTRY_AUTH_TOKEN="your_sentry_token"

3. Run Initial Sync

from src.services.external_context import ExternalContextOrchestrator

# Initialize orchestrator
orchestrator = ExternalContextOrchestrator(
    duckdb_conn=your_duckdb_connection,
    pg_conn=your_postgres_connection,  # optional
    repo_path="/path/to/your/repo"
)

# Sync all sources
result = await orchestrator.sync_all(
    git_config={"since_days": 90, "include_blame": True},
    issue_config={"provider": "github", "repo": "owner/repo", "token": os.getenv("GITHUB_TOKEN")},
    sentry_config={"org_slug": "my-org", "project_slug": "my-project", "token": os.getenv("SENTRY_AUTH_TOKEN")}
)

print(result)

Git Integration

GitSyncService

Syncs git history metadata to CPG tags.

from src.services.external_context import GitSyncService

service = GitSyncService(
    duckdb_conn=conn,
    pg_conn=pg_conn,  # optional, for storing raw data
    repo_path="/path/to/repo"
)

# Sync last 30 days of commits
result = await service.sync(since_days=30, include_blame=True)
print(f"Synced {result.items_synced} commits, created {result.tags_created} tags")

Available Git Tags

Tag Name Description Example Value
git-commit SHA of last commit modifying this method a1b2c3d4...
git-author Email of last modifier dev@company.com
git-branch Branch where code originated feature/PROJ-123
git-blame-count Number of unique authors 3
git-churn Number of modifications 15
git-last-modified Timestamp of last change 2025-01-09T10:30:00Z

Git Queries

-- Find all methods modified by a specific author
SELECT m.full_name, m.filename, t.value as author
FROM nodes_method m
JOIN edges_tagged_by e ON m.id = e.src
JOIN nodes_tag t ON e.dst = t.id
WHERE t.name = 'git-author' AND t.value = 'developer@example.com';

-- Find high-churn code (methods changed > 10 times)
SELECT m.full_name, CAST(t.value AS INT) as churn_count
FROM nodes_method m
JOIN edges_tagged_by e ON m.id = e.src
JOIN nodes_tag t ON e.dst = t.id
WHERE t.name = 'git-churn' AND CAST(t.value AS INT) > 10
ORDER BY churn_count DESC;

-- Find bus factor candidates (methods with only 1 author)
SELECT m.full_name, t_author.value as sole_author
FROM nodes_method m
JOIN edges_tagged_by e1 ON m.id = e1.src
JOIN nodes_tag t_blame ON e1.dst = t_blame.id
JOIN edges_tagged_by e2 ON m.id = e2.src
JOIN nodes_tag t_author ON e2.dst = t_author.id
WHERE t_blame.name = 'git-blame-count' AND t_blame.value = '1'
  AND t_author.name = 'git-author';

Issue Tracker Integration

IssueSyncService

Links issues to code via commit message references.

from src.services.external_context import IssueSyncService

# GitHub
service = IssueSyncService(
    duckdb_conn=conn,
    pg_conn=pg_conn,
    provider="github",
    repo="owner/repo",
    token=os.getenv("GITHUB_TOKEN")
)

# Jira
service = IssueSyncService(
    duckdb_conn=conn,
    pg_conn=pg_conn,
    provider="jira",
    repo="PROJ",  # Jira project key
    token=os.getenv("JIRA_TOKEN"),
    base_url="https://company.atlassian.net",
    email=os.getenv("JIRA_EMAIL")
)

result = await service.sync()

Supported Providers

Provider Config Issue Pattern
GitHub provider="github" #123, GH-123
GitLab provider="gitlab" #123, !123 (MRs)
Jira provider="jira" PROJ-123

Available Issue Tags

Tag Name Description Example Value
issue-id Issue identifier PROJ-123, #456
issue-type Type of issue bug, feature, refactor
issue-status Current status open, closed, in_progress
issue-label Issue labels critical, tech-debt

APM/Sentry Integration

SentrySyncService

Syncs error data from Sentry to identify error-prone code.

from src.services.external_context import SentrySyncService

service = SentrySyncService(
    duckdb_conn=conn,
    pg_conn=pg_conn,
    org_slug="my-organization",
    project_slug="my-project",
    token=os.getenv("SENTRY_AUTH_TOKEN")
)

result = await service.sync(days_back=30)

Available Sentry Tags

Tag Name Description Example Value
sentry-issue Sentry issue ID SENTRY-12345
error-frequency Errors per day 150
error-level Severity level error, fatal, warning
error-type Exception type NullPointerException

Using the Orchestrator

Python API

from src.services.external_context import ExternalContextOrchestrator

orchestrator = ExternalContextOrchestrator(
    duckdb_conn=conn,
    pg_conn=pg_conn,
    repo_path="."
)

# Sync all sources
result = await orchestrator.sync_all(
    git_config={"since_days": 30, "include_blame": True},
    issue_config={"provider": "github", "repo": "owner/repo", "token": token},
    sentry_config={"org_slug": "org", "project_slug": "proj", "token": sentry_token},
    parallel=True  # Run syncs in parallel
)

# Sync individual sources
git_result = await orchestrator.sync_git(since_days=30)
issue_result = await orchestrator.sync_issues(provider="github", repo="owner/repo", token=token)
sentry_result = await orchestrator.sync_sentry(org_slug="org", project_slug="proj", token=token)

CLI Interface

# Sync git history
python -m src.services.external_context.orchestrator \
    --repo-path /path/to/repo \
    --git --git-days 30 --git-blame

# Sync GitHub issues
python -m src.services.external_context.orchestrator \
    --repo-path /path/to/repo \
    --issues --issue-provider github --issue-repo owner/repo

# Sync Sentry errors
python -m src.services.external_context.orchestrator \
    --repo-path /path/to/repo \
    --sentry --sentry-org my-org --sentry-project my-project

# Sync all
python -m src.services.external_context.orchestrator \
    --repo-path /path/to/repo \
    --all --parallel

Query Examples

Using CPGQueryService

from src.services.cpg_query_service import CPGQueryService

cpg = CPGQueryService(duckdb_conn)

# Find methods by author
methods = cpg.get_methods_by_author("developer@example.com", limit=50)

# List all contributors
authors = cpg.get_git_authors(limit=20)

# Find methods linked to an issue
methods = cpg.get_methods_by_issue("PROJ-123")

# Find error-prone methods
errors = cpg.get_error_prone_methods(min_frequency=10)

# Find high-churn code
hotspots = cpg.get_git_hotspots(min_churn=5)

# Find risky code by author (security-risk + author)
risky = cpg.get_risky_code_by_author("developer@example.com")

# Find bus factor candidates
bus_factor = cpg.get_bus_factor_candidates(max_authors=1)

# Get external context statistics
stats = cpg.get_external_context_stats()

Natural Language Queries

The onboarding workflow supports natural language queries for external context:

  • English: “Who wrote the authentication module?”, “What issues are linked to the parser?”, “Which methods have production errors?”
  • Russian: “Кто написал модуль аутентификации?”, “Какие задачи связаны с парсером?”, “Какие методы вызывают ошибки?”

API Reference

CPGQueryService Methods

Method Description Parameters
get_methods_by_author(email, limit) Find methods by author email: str, limit: int
get_git_authors(limit) List all contributors limit: int
get_methods_by_issue(issue_id, limit) Find methods by issue issue_id: str, limit: int
get_error_prone_methods(min_frequency, limit) Find error hotspots min_frequency: int, limit: int
get_git_hotspots(min_churn, limit) Find high-churn code min_churn: int, limit: int
get_risky_code_by_author(email, limit) Author + security risk email: str, limit: int
get_bus_factor_candidates(max_authors, limit) Single-owner code max_authors: int, limit: int
get_external_context_stats() Tag statistics -

HybridExternalSearch Methods

Method Description
find_methods_by_author(email) Hybrid search for author’s code
find_methods_by_issue(issue_id) Hybrid search for issue-linked code
find_error_prone_methods() Hybrid search for error hotspots
find_code_for_incident(description) Semantic search for incident-related code

Troubleshooting

No Tags Created

  1. Check that CPG database has methods in nodes_method
  2. Verify file paths in CPG match repository structure
  3. Run sync with include_blame=True for git

GitHub Rate Limiting

# Use authenticated requests
service = IssueSyncService(
    provider="github",
    token=os.getenv("GITHUB_TOKEN")  # Required for higher rate limits
)

Sentry API Errors

  1. Verify org_slug and project_slug are correct
  2. Check token has project:read and event:read scopes
  3. Try with --sentry-days 7 for smaller data set

Next Steps