External Context Integration Guide¶
Integration guide for linking CPG code entities with external systems (Git, Issue Trackers, APM).
Table of Contents¶
- Overview
- Architecture
- Quick Setup
- Git Integration
- GitSyncService
- Available Tags
- Git Queries
- Issue Tracker Integration
- IssueSyncService
- Supported Providers
- Available Tags
- APM/Sentry Integration
- SentrySyncService
- Available Tags
- Using the Orchestrator
- Python API
- CLI Interface
- Query Examples
- API Reference
- Troubleshooting
Overview¶
External Context Integration allows you to link code entities (methods, functions, classes) in your Code Property Graph (CPG) with metadata from external systems:
- Git: Author information, commit history, code churn
- Issue Trackers: Jira, GitHub Issues, GitLab Issues
- APM/Error Tracking: Sentry error data, frequency, severity
This enables powerful queries like: - “Who wrote this code?” - “What issues are linked to this function?” - “Which methods have the most production errors?” - “What code changes most frequently?”
Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ External Systems │
├─────────────┬─────────────────────┬─────────────────────────┤
│ Git │ Issue Trackers │ Sentry │
│ (commits) │ (Jira/GitHub/GL) │ (errors) │
└──────┬──────┴──────────┬──────────┴────────────┬────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ ExternalContextOrchestrator │
│ ┌─────────────┐ ┌───────────────┐ ┌───────────────────┐ │
│ │GitSyncService│ │IssueSyncService│ │SentrySyncService │ │
│ └─────────────┘ └───────────────┘ └───────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────────┐
│ PostgreSQL │ │ ChromaDB │
│ (raw metadata) │ │ (semantic search) │
└────────┬─────────┘ └──────────┬───────────┘
│ │
└───────────────────┬───────────────────┘
▼
┌─────────────────┐
│ DuckDB CPG │
│ (nodes_tag) │
└─────────────────┘
Quick Setup¶
1. Install Dependencies¶
pip install -r requirements.txt
2. Configure Environment Variables¶
# Git (no additional config needed - uses local git)
# GitHub Issues
export GITHUB_TOKEN="ghp_xxxxxxxxxxxx"
# GitLab Issues
export GITLAB_TOKEN="glpat-xxxxxxxxxxxx"
# Jira
export JIRA_TOKEN="your_jira_api_token"
export JIRA_EMAIL="your@email.com"
# Sentry
export SENTRY_AUTH_TOKEN="your_sentry_token"
3. Run Initial Sync¶
from src.services.external_context import ExternalContextOrchestrator
# Initialize orchestrator
orchestrator = ExternalContextOrchestrator(
duckdb_conn=your_duckdb_connection,
pg_conn=your_postgres_connection, # optional
repo_path="/path/to/your/repo"
)
# Sync all sources
result = await orchestrator.sync_all(
git_config={"since_days": 90, "include_blame": True},
issue_config={"provider": "github", "repo": "owner/repo", "token": os.getenv("GITHUB_TOKEN")},
sentry_config={"org_slug": "my-org", "project_slug": "my-project", "token": os.getenv("SENTRY_AUTH_TOKEN")}
)
print(result)
Git Integration¶
GitSyncService¶
Syncs git history metadata to CPG tags.
from src.services.external_context import GitSyncService
service = GitSyncService(
duckdb_conn=conn,
pg_conn=pg_conn, # optional, for storing raw data
repo_path="/path/to/repo"
)
# Sync last 30 days of commits
result = await service.sync(since_days=30, include_blame=True)
print(f"Synced {result.items_synced} commits, created {result.tags_created} tags")
Available Git Tags¶
| Tag Name | Description | Example Value |
|---|---|---|
git-commit |
SHA of last commit modifying this method | a1b2c3d4... |
git-author |
Email of last modifier | dev@company.com |
git-branch |
Branch where code originated | feature/PROJ-123 |
git-blame-count |
Number of unique authors | 3 |
git-churn |
Number of modifications | 15 |
git-last-modified |
Timestamp of last change | 2025-01-09T10:30:00Z |
Git Queries¶
-- Find all methods modified by a specific author
SELECT m.full_name, m.filename, t.value as author
FROM nodes_method m
JOIN edges_tagged_by e ON m.id = e.src
JOIN nodes_tag t ON e.dst = t.id
WHERE t.name = 'git-author' AND t.value = 'developer@example.com';
-- Find high-churn code (methods changed > 10 times)
SELECT m.full_name, CAST(t.value AS INT) as churn_count
FROM nodes_method m
JOIN edges_tagged_by e ON m.id = e.src
JOIN nodes_tag t ON e.dst = t.id
WHERE t.name = 'git-churn' AND CAST(t.value AS INT) > 10
ORDER BY churn_count DESC;
-- Find bus factor candidates (methods with only 1 author)
SELECT m.full_name, t_author.value as sole_author
FROM nodes_method m
JOIN edges_tagged_by e1 ON m.id = e1.src
JOIN nodes_tag t_blame ON e1.dst = t_blame.id
JOIN edges_tagged_by e2 ON m.id = e2.src
JOIN nodes_tag t_author ON e2.dst = t_author.id
WHERE t_blame.name = 'git-blame-count' AND t_blame.value = '1'
AND t_author.name = 'git-author';
Issue Tracker Integration¶
IssueSyncService¶
Links issues to code via commit message references.
from src.services.external_context import IssueSyncService
# GitHub
service = IssueSyncService(
duckdb_conn=conn,
pg_conn=pg_conn,
provider="github",
repo="owner/repo",
token=os.getenv("GITHUB_TOKEN")
)
# Jira
service = IssueSyncService(
duckdb_conn=conn,
pg_conn=pg_conn,
provider="jira",
repo="PROJ", # Jira project key
token=os.getenv("JIRA_TOKEN"),
base_url="https://company.atlassian.net",
email=os.getenv("JIRA_EMAIL")
)
result = await service.sync()
Supported Providers¶
| Provider | Config | Issue Pattern |
|---|---|---|
| GitHub | provider="github" |
#123, GH-123 |
| GitLab | provider="gitlab" |
#123, !123 (MRs) |
| Jira | provider="jira" |
PROJ-123 |
Available Issue Tags¶
| Tag Name | Description | Example Value |
|---|---|---|
issue-id |
Issue identifier | PROJ-123, #456 |
issue-type |
Type of issue | bug, feature, refactor |
issue-status |
Current status | open, closed, in_progress |
issue-label |
Issue labels | critical, tech-debt |
APM/Sentry Integration¶
SentrySyncService¶
Syncs error data from Sentry to identify error-prone code.
from src.services.external_context import SentrySyncService
service = SentrySyncService(
duckdb_conn=conn,
pg_conn=pg_conn,
org_slug="my-organization",
project_slug="my-project",
token=os.getenv("SENTRY_AUTH_TOKEN")
)
result = await service.sync(days_back=30)
Available Sentry Tags¶
| Tag Name | Description | Example Value |
|---|---|---|
sentry-issue |
Sentry issue ID | SENTRY-12345 |
error-frequency |
Errors per day | 150 |
error-level |
Severity level | error, fatal, warning |
error-type |
Exception type | NullPointerException |
Using the Orchestrator¶
Python API¶
from src.services.external_context import ExternalContextOrchestrator
orchestrator = ExternalContextOrchestrator(
duckdb_conn=conn,
pg_conn=pg_conn,
repo_path="."
)
# Sync all sources
result = await orchestrator.sync_all(
git_config={"since_days": 30, "include_blame": True},
issue_config={"provider": "github", "repo": "owner/repo", "token": token},
sentry_config={"org_slug": "org", "project_slug": "proj", "token": sentry_token},
parallel=True # Run syncs in parallel
)
# Sync individual sources
git_result = await orchestrator.sync_git(since_days=30)
issue_result = await orchestrator.sync_issues(provider="github", repo="owner/repo", token=token)
sentry_result = await orchestrator.sync_sentry(org_slug="org", project_slug="proj", token=token)
CLI Interface¶
# Sync git history
python -m src.services.external_context.orchestrator \
--repo-path /path/to/repo \
--git --git-days 30 --git-blame
# Sync GitHub issues
python -m src.services.external_context.orchestrator \
--repo-path /path/to/repo \
--issues --issue-provider github --issue-repo owner/repo
# Sync Sentry errors
python -m src.services.external_context.orchestrator \
--repo-path /path/to/repo \
--sentry --sentry-org my-org --sentry-project my-project
# Sync all
python -m src.services.external_context.orchestrator \
--repo-path /path/to/repo \
--all --parallel
Query Examples¶
Using CPGQueryService¶
from src.services.cpg_query_service import CPGQueryService
cpg = CPGQueryService(duckdb_conn)
# Find methods by author
methods = cpg.get_methods_by_author("developer@example.com", limit=50)
# List all contributors
authors = cpg.get_git_authors(limit=20)
# Find methods linked to an issue
methods = cpg.get_methods_by_issue("PROJ-123")
# Find error-prone methods
errors = cpg.get_error_prone_methods(min_frequency=10)
# Find high-churn code
hotspots = cpg.get_git_hotspots(min_churn=5)
# Find risky code by author (security-risk + author)
risky = cpg.get_risky_code_by_author("developer@example.com")
# Find bus factor candidates
bus_factor = cpg.get_bus_factor_candidates(max_authors=1)
# Get external context statistics
stats = cpg.get_external_context_stats()
Natural Language Queries¶
The onboarding workflow supports natural language queries for external context:
- English: “Who wrote the authentication module?”, “What issues are linked to the parser?”, “Which methods have production errors?”
- Russian: “Кто написал модуль аутентификации?”, “Какие задачи связаны с парсером?”, “Какие методы вызывают ошибки?”
API Reference¶
CPGQueryService Methods¶
| Method | Description | Parameters |
|---|---|---|
get_methods_by_author(email, limit) |
Find methods by author | email: str, limit: int |
get_git_authors(limit) |
List all contributors | limit: int |
get_methods_by_issue(issue_id, limit) |
Find methods by issue | issue_id: str, limit: int |
get_error_prone_methods(min_frequency, limit) |
Find error hotspots | min_frequency: int, limit: int |
get_git_hotspots(min_churn, limit) |
Find high-churn code | min_churn: int, limit: int |
get_risky_code_by_author(email, limit) |
Author + security risk | email: str, limit: int |
get_bus_factor_candidates(max_authors, limit) |
Single-owner code | max_authors: int, limit: int |
get_external_context_stats() |
Tag statistics | - |
HybridExternalSearch Methods¶
| Method | Description |
|---|---|
find_methods_by_author(email) |
Hybrid search for author’s code |
find_methods_by_issue(issue_id) |
Hybrid search for issue-linked code |
find_error_prone_methods() |
Hybrid search for error hotspots |
find_code_for_incident(description) |
Semantic search for incident-related code |
Troubleshooting¶
No Tags Created¶
- Check that CPG database has methods in
nodes_method - Verify file paths in CPG match repository structure
- Run sync with
include_blame=Truefor git
GitHub Rate Limiting¶
# Use authenticated requests
service = IssueSyncService(
provider="github",
token=os.getenv("GITHUB_TOKEN") # Required for higher rate limits
)
Sentry API Errors¶
- Verify
org_slugandproject_slugare correct - Check token has
project:readandevent:readscopes - Try with
--sentry-days 7for smaller data set
Next Steps¶
- SQL Query Cookbook - More SQL examples
- Onboarding Scenario - Using external context in queries
- Architecture - System architecture details