External Context Integration Guide

Integration guide for linking CPG code entities with external systems (Git, Issue Trackers, APM).

Table of Contents

Overview

External Context Integration allows you to link code entities (methods, functions, classes) in your Code Property Graph (CPG) with metadata from external systems:

  • Git: Author information, commit history, code churn, blame data
  • Issue Trackers: Jira, GitHub Issues, GitLab Issues
  • APM/Error Tracking: Sentry error data, frequency buckets, severity

This enables powerful queries like: - “Who wrote this code?” - “What issues are linked to this function?” - “Which methods have the most production errors?” - “What code changes most frequently?”

Architecture

+-----------------------------------------------------------------+
|                    External Systems                              |
+---------------+---------------------+---------------------------+
|    Git        |   Issue Trackers    |        Sentry             |
|  (commits)    |  (Jira/GitHub/GL)   |       (errors)            |
+-------+-------+----------+----------+-----------+---------------+
        |                  |                      |
        v                  v                      v
+-----------------------------------------------------------------+
|              ExternalContextOrchestrator                         |
|  +---------------+ +-----------------+ +---------------------+  |
|  |GitSyncService | |IssueSyncService | |SentrySyncService    |  |
|  +---------------+ +-----------------+ +---------------------+  |
+-----------------------------------------------------------------+
        |                                          |
        v                                          v
+--------------------+                  +------------------------+
|   PostgreSQL       |                  |      DuckDB CPG        |
| (raw metadata)     |                  |   (nodes_tag_v2,       |
| - external_context |                  |    edges_tagged_by,    |
| - file_commit_hist |                  |    nodes_method)        |
| - git_authors      |                  +------------------------+
| - runtime_metrics  |
+--------------------+

Data flows from external systems through the sync services into two stores:

  • PostgreSQL (optional): stores raw metadata in tables external_context, file_commit_history, git_authors, runtime_metrics
  • DuckDB CPG: stores tags in nodes_tag_v2 and links them to methods via edges_tagged_by

All tags are stored in nodes_tag_v2 (not nodes_tag). Method-tag links go through edges_tagged_by.

Quick Setup

1. Install Dependencies

pip install -r requirements.txt

2. Configure Environment Variables

# Git (no additional config needed - uses local git)

# GitHub Issues
export GITHUB_TOKEN="ghp_xxxxxxxxxxxx"

# GitLab Issues
export GITLAB_TOKEN="glpat-xxxxxxxxxxxx"

# Jira
export JIRA_TOKEN="your_jira_api_token"

# Sentry
export SENTRY_AUTH_TOKEN="your_sentry_token"

3. Run Initial Sync

from src.services.external_context import ExternalContextOrchestrator

orchestrator = ExternalContextOrchestrator(
    duckdb_conn=your_duckdb_connection,
    pg_conn=your_postgres_connection,  # optional
    repo_path="/path/to/your/repo"
)

result = orchestrator.sync_all(
    git_config={"since_days": 90, "use_blame": True},
    issue_config={
        "provider": "github",
        "repo": "owner/repo",
        "token": os.getenv("GITHUB_TOKEN"),
    },
    sentry_config={
        "org_slug": "my-org",
        "project_slug": "my-project",
        "token": os.getenv("SENTRY_AUTH_TOKEN"),
    },
)

print(f"Total synced: {result.total_items_synced}, tags: {result.total_tags_created}")

Git Integration

GitSyncService

Syncs git history metadata to CPG tags.

from src.services.external_context import GitSyncService

service = GitSyncService(
    duckdb_conn=conn,
    pg_conn=pg_conn,        # optional, for storing raw data
    repo_path="/path/to/repo",
    max_commits=1000,        # default
    since_days=90            # default
)

# Sync with blame analysis
result = service.sync(use_blame=True)
print(f"Synced {result.items_synced} commits, created {result.tags_created} tags")

# Sync only specific files, without blame
result = service.sync(files=["src/main.py", "src/auth.py"], use_blame=False)

Constructor parameters:

Parameter Type Default Description
duckdb_conn connection required DuckDB connection
pg_conn connection None PostgreSQL connection (optional)
repo_path str "." Path to git repository
max_commits int 1000 Maximum commits to process
since_days int 90 How many days back to look

Sync parameters:

Parameter Type Default Description
files List[str] None Limit sync to specific files
use_blame bool True Run git blame analysis

Available Git Tags

Tag Name Description Example Value
git-author Email of last modifier dev@company.com
git-commit SHA of last commit modifying this method a1b2c3d4...
git-branch Branch where code originated feature/PROJ-123
git-blame-count Number of unique authors 3
git-churn Number of modifications 15
git-last-modified Timestamp of last change 2025-01-09T10:30:00Z

Git Queries

All queries use nodes_tag_v2 (the current tag table).

-- Find all methods modified by a specific author
SELECT m.full_name, m.filename, t.value AS author
FROM nodes_method m
JOIN edges_tagged_by e ON m.id = e.src
JOIN nodes_tag_v2 t ON e.dst = t.id
WHERE t.name = 'git-author' AND t.value = 'developer@example.com';

-- Find high-churn code (methods changed > 10 times)
SELECT m.full_name, CAST(t.value AS INT) AS churn_count
FROM nodes_method m
JOIN edges_tagged_by e ON m.id = e.src
JOIN nodes_tag_v2 t ON e.dst = t.id
WHERE t.name = 'git-churn' AND CAST(t.value AS INT) > 10
ORDER BY churn_count DESC;

-- Find bus factor candidates (methods with only 1 author)
SELECT m.full_name, t_author.value AS sole_author
FROM nodes_method m
JOIN edges_tagged_by e1 ON m.id = e1.src
JOIN nodes_tag_v2 t_blame ON e1.dst = t_blame.id
JOIN edges_tagged_by e2 ON m.id = e2.src
JOIN nodes_tag_v2 t_author ON e2.dst = t_author.id
WHERE t_blame.name = 'git-blame-count' AND t_blame.value = '1'
  AND t_author.name = 'git-author';

-- Find recently modified methods (last 7 days)
SELECT m.full_name, m.filename, t.value AS last_modified
FROM nodes_method m
JOIN edges_tagged_by e ON m.id = e.src
JOIN nodes_tag_v2 t ON e.dst = t.id
WHERE t.name = 'git-last-modified'
  AND CAST(t.value AS TIMESTAMP) > NOW() - INTERVAL '7 days'
ORDER BY t.value DESC;

Issue Tracker Integration

IssueSyncService

Links issues to code via commit message references.

from src.services.external_context import IssueSyncService

# GitHub
service = IssueSyncService(
    duckdb_conn=conn,
    pg_conn=pg_conn,
    provider="github",
    repo="owner/repo",
    token=os.getenv("GITHUB_TOKEN"),
    max_issues=500  # default
)

# GitLab
service = IssueSyncService(
    duckdb_conn=conn,
    provider="gitlab",
    repo="owner/repo",
    base_url="https://gitlab.company.com",
    token=os.getenv("GITLAB_TOKEN")
)

# Jira
service = IssueSyncService(
    duckdb_conn=conn,
    provider="jira",
    project_key="PROJ",
    base_url="https://company.atlassian.net",
    token=os.getenv("JIRA_TOKEN")
)

result = service.sync(since_days=90, link_via_commits=True)

Constructor parameters:

Parameter Type Default Description
duckdb_conn connection required DuckDB connection
pg_conn connection None PostgreSQL connection (optional)
provider str "github" Provider: github, gitlab, jira
repo str None Repository (GitHub/GitLab: owner/repo)
base_url str None Base URL for self-hosted instances
project_key str None Jira project key (e.g., PROJ)
token str None API authentication token
max_issues int 500 Maximum issues to fetch

Sync parameters:

Parameter Type Default Description
since_days int 90 How many days back to look
link_via_commits bool True Link issues to code through commit refs

Supported Providers

Provider Config Issue Pattern
GitHub provider="github", repo="owner/repo" #123, GH-123
GitLab provider="gitlab", repo="owner/repo" #123, !123 (MRs)
Jira provider="jira", project_key="PROJ" PROJ-123

Note: For Jira, use the project_key parameter (not repo).

Available Issue Tags

Tag Name Description Example Value
issue-id Issue identifier PROJ-123, #456
issue-type Type of issue bug, feature, refactor
issue-status Current status open, closed, in_progress
issue-priority Issue priority critical, high, medium, low
issue-label Issue labels critical, tech-debt

Creating Issues

The IssueSyncService can create Jira issues programmatically:

service = IssueSyncService(
    duckdb_conn=conn,
    provider="jira",
    project_key="PROJ",
    base_url="https://company.atlassian.net",
    token=os.getenv("JIRA_TOKEN")
)

issue_key = service.create_jira_issue(
    summary="Fix NullPointerException in AuthService",
    description="Method authenticate() throws NPE when token is expired",
    issue_type="Bug",        # default
    priority="Medium",       # default
    labels=["codegraph", "auto-detected"],
    components=["auth"]
)
# Returns "PROJ-456" or None on failure

APM/Sentry Integration

SentrySyncService

Syncs error data from Sentry to identify error-prone code.

from src.services.external_context import SentrySyncService

service = SentrySyncService(
    duckdb_conn=conn,
    pg_conn=pg_conn,
    org_slug="my-organization",
    project_slug="my-project",
    token=os.getenv("SENTRY_AUTH_TOKEN"),
    base_url="https://sentry.io",  # default; override for self-hosted
    max_issues=200                  # default
)

result = service.sync(since_days=30, min_events=1)

Constructor parameters:

Parameter Type Default Description
duckdb_conn connection required DuckDB connection
pg_conn connection None PostgreSQL connection (optional)
org_slug str None Sentry organization slug
project_slug str None Sentry project slug
token str None Sentry auth token
base_url str "https://sentry.io" Base URL (override for self-hosted)
max_issues int 200 Maximum issues to fetch

Sync parameters:

Parameter Type Default Description
since_days int 30 How many days back to look
min_events int 1 Minimum event count to include

Available Sentry Tags

Tag Name Description Example Value
sentry-issue Sentry issue ID SENTRY-12345
error-frequency Frequency bucket based on event count critical, high, medium, low, rare
error-type Exception type NullPointerException
error-level Severity level error, fatal, warning
sentry-first-seen When error was first observed 2025-06-15T08:30:00Z

The error-frequency tag uses buckets based on total event count:

Bucket Event Count
critical >= 10000
high >= 1000
medium >= 100
low >= 10
rare < 10
-- Find methods with critical error frequency
SELECT m.full_name, m.filename, t.value AS frequency
FROM nodes_method m
JOIN edges_tagged_by e ON m.id = e.src
JOIN nodes_tag_v2 t ON e.dst = t.id
WHERE t.name = 'error-frequency' AND t.value = 'critical';

-- Find methods by error type
SELECT m.full_name, t.value AS error_type
FROM nodes_method m
JOIN edges_tagged_by e ON m.id = e.src
JOIN nodes_tag_v2 t ON e.dst = t.id
WHERE t.name = 'error-type' AND t.value LIKE '%NullPointer%';

-- Combine: methods with high+ frequency AND fatal level
SELECT DISTINCT m.full_name, m.filename
FROM nodes_method m
JOIN edges_tagged_by e1 ON m.id = e1.src
JOIN nodes_tag_v2 t_freq ON e1.dst = t_freq.id
JOIN edges_tagged_by e2 ON m.id = e2.src
JOIN nodes_tag_v2 t_level ON e2.dst = t_level.id
WHERE t_freq.name = 'error-frequency' AND t_freq.value IN ('critical', 'high')
  AND t_level.name = 'error-level' AND t_level.value = 'fatal';

Using the Orchestrator

Python API

The ExternalContextOrchestrator coordinates all sync services.

from src.services.external_context import ExternalContextOrchestrator

orchestrator = ExternalContextOrchestrator(
    duckdb_conn=conn,
    pg_conn=pg_conn,       # optional
    repo_path=".",
    vector_store=None      # optional, for semantic indexing
)

# Sync all sources at once
result = orchestrator.sync_all(
    git_config={"since_days": 30, "use_blame": True},
    issue_config={
        "provider": "github",
        "repo": "owner/repo",
        "token": token,
    },
    sentry_config={
        "org_slug": "org",
        "project_slug": "proj",
        "token": sentry_token,
    },
    parallel=True  # run syncs in parallel
)

print(f"Success: {result.success}")
print(f"Items synced: {result.total_items_synced}")
print(f"Tags created: {result.total_tags_created}")
print(f"Edges created: {result.total_edges_created}")
print(f"Duration: {result.duration_seconds:.1f}s")

# Access per-source results
for source, src_result in result.source_results.items():
    print(f"  {source}: {src_result.items_synced} items, {src_result.tags_created} tags")

# Sync individual sources
git_result = orchestrator.sync_git(since_days=30, use_blame=True)
issue_result = orchestrator.sync_issues(
    provider="github", repo="owner/repo", token=token
)
sentry_result = orchestrator.sync_sentry(
    org_slug="org", project_slug="proj", token=token
)

# Get sync statistics
stats = orchestrator.get_sync_stats()

CLI Interface

# Sync git history
python -m src.services.external_context.orchestrator \
    --duckdb /path/to/cpg.duckdb \
    --repo-path /path/to/repo \
    --sync-git --git-since-days 30

# Sync git without blame analysis
python -m src.services.external_context.orchestrator \
    --duckdb /path/to/cpg.duckdb \
    --sync-git --no-blame

# Sync git with custom max commits
python -m src.services.external_context.orchestrator \
    --duckdb /path/to/cpg.duckdb \
    --sync-git --git-max-commits 2000

# Sync GitHub issues
python -m src.services.external_context.orchestrator \
    --duckdb /path/to/cpg.duckdb \
    --sync-issues --issue-provider github --issue-repo owner/repo \
    --issue-token "$GITHUB_TOKEN"

# Sync GitLab issues (self-hosted)
python -m src.services.external_context.orchestrator \
    --duckdb /path/to/cpg.duckdb \
    --sync-issues --issue-provider gitlab --issue-repo owner/repo \
    --issue-url https://gitlab.company.com \
    --issue-token "$GITLAB_TOKEN"

# Sync Jira issues
python -m src.services.external_context.orchestrator \
    --duckdb /path/to/cpg.duckdb \
    --sync-issues --issue-provider jira --issue-project PROJ \
    --issue-url https://company.atlassian.net \
    --issue-token "$JIRA_TOKEN"

# Sync Sentry errors
python -m src.services.external_context.orchestrator \
    --duckdb /path/to/cpg.duckdb \
    --sync-sentry --sentry-org my-org --sentry-project my-project \
    --sentry-token "$SENTRY_AUTH_TOKEN"

# Sync Sentry (self-hosted)
python -m src.services.external_context.orchestrator \
    --duckdb /path/to/cpg.duckdb \
    --sync-sentry --sentry-org my-org --sentry-project my-project \
    --sentry-token "$SENTRY_AUTH_TOKEN" --sentry-url https://sentry.company.com

# Sync all sources
python -m src.services.external_context.orchestrator \
    --duckdb /path/to/cpg.duckdb \
    --repo-path /path/to/repo \
    --sync-all

# View sync statistics
python -m src.services.external_context.orchestrator \
    --duckdb /path/to/cpg.duckdb --stats

# Verbose output
python -m src.services.external_context.orchestrator \
    --duckdb /path/to/cpg.duckdb --sync-git --verbose

Full CLI flags reference:

Flag Description Default
--duckdb Path to DuckDB database required
--pg-url PostgreSQL connection URL None
--repo-path Path to git repository .
--sync-all Sync all sources -
--sync-git Sync git history -
--sync-issues Sync issue tracker -
--sync-sentry Sync Sentry errors -
--git-since-days Git lookback period in days 90
--git-max-commits Maximum git commits to process 1000
--no-blame Skip git blame analysis -
--issue-provider Issue provider (github, gitlab, jira) -
--issue-repo Repository for GitHub/GitLab -
--issue-project Project key for Jira -
--issue-url Base URL for self-hosted instances -
--issue-token Issue tracker API token -
--sentry-org Sentry organization slug -
--sentry-project Sentry project slug -
--sentry-token Sentry auth token -
--sentry-url Sentry base URL (self-hosted) -
--stats Show sync statistics -
--verbose Verbose output -

Convenience Functions

The src.services.external_context package exports three convenience functions for quick one-off syncs:

from src.services.external_context import (
    sync_git_to_cpg,
    sync_issues_to_cpg,
    sync_sentry_to_cpg,
)

# Sync git metadata
result = sync_git_to_cpg(
    duckdb_conn=conn,
    pg_conn=pg_conn,       # optional
    repo_path=".",
    since_days=30,
    use_blame=True
)

# Sync issues
result = sync_issues_to_cpg(
    duckdb_conn=conn,
    pg_conn=pg_conn,
    provider="github",
    repo="owner/repo",
    token=os.getenv("GITHUB_TOKEN")
)

# Sync Sentry errors
result = sync_sentry_to_cpg(
    duckdb_conn=conn,
    pg_conn=pg_conn,
    org_slug="my-org",
    project_slug="my-project",
    token=os.getenv("SENTRY_AUTH_TOKEN")
)

All three return a SyncResult dataclass.

Server-Side API Reference

Enums

ExternalSource – identifies the external system:

Value Description
GIT Git repository
JIRA Jira issue tracker
GITHUB GitHub
GITLAB GitLab
SENTRY Sentry error tracking
SONARQUBE SonarQube code quality

ContextType – type of external context:

Value Description
COMMIT Git commit
ISSUE Issue/ticket
ERROR Error/exception
METRIC Performance metric
REVIEW Code review

Core Dataclasses

SyncResult – result of a sync operation:

Field Type Default Description
source ExternalSource required Source system
context_type ContextType required Type of context
success bool required Whether sync succeeded
items_synced int 0 Number of items synced
items_failed int 0 Number of items that failed
tags_created int 0 Number of tags created
edges_created int 0 Number of edges created
duration_seconds float 0.0 Duration of sync
errors List[str] field(default_factory=list) Error messages
metadata Dict[str, Any] field(default_factory=dict) Additional metadata

ExternalTag – a tag to attach to a CPG node:

Field Type Default Description
name str required Tag name (e.g., git-author)
value str required Tag value
external_source ExternalSource required Source system
external_id str required ID in external system
external_url Optional[str] None URL in external system
confidence float 1.0 Confidence of the link
metadata Dict[str, Any] field(default_factory=dict) Additional metadata

MethodTagLink – links a method to a tag:

Field Type Description
method_id int CPG method node ID
method_full_name str Fully qualified method name
filename str Source file path
line_start int Method start line
line_end int Method end line
tag ExternalTag The tag to attach

Git Dataclasses

GitCommit:

Field Type Default Description
sha str required Commit SHA
author_email str required Author email
author_name str required Author name
timestamp datetime required Commit timestamp
message str required Commit message
files List[str] required Changed files
lines_added int 0 Lines added
lines_deleted int 0 Lines deleted
branch Optional[str] None Branch name
issue_refs List[str] None Referenced issue IDs

GitBlameEntry:

Field Type Description
commit_sha str Commit SHA
author_email str Author email
author_name str Author name
timestamp datetime Commit timestamp
line_number int Line number
line_content str Line content

Issue Dataclasses

Issue:

Field Type Default Description
id str required Issue ID
title str required Issue title
description Optional[str] required Issue description
issue_type str required Type (bug, feature, etc.)
status str required Current status
priority Optional[str] required Priority level
assignee Optional[str] required Assignee
reporter Optional[str] required Reporter
labels List[str] required Labels
created_at Optional[datetime] required Creation timestamp
updated_at Optional[datetime] required Last update timestamp
url Optional[str] required Issue URL
linked_commits List[str] required Linked commit SHAs
linked_files List[str] required Linked file paths

Sentry Dataclasses

SentryIssue:

Field Type Default Description
id str required Sentry issue ID
short_id str required Short ID (e.g., PROJ-ABC)
title str required Issue title
culprit str required Culprit (function/file)
level str required Severity level
status str required Issue status
count int required Total event count
user_count int required Affected user count
first_seen datetime required First occurrence
last_seen datetime required Last occurrence
url Optional[str] None Sentry issue URL
stacktrace_frames List[Dict] required Raw stacktrace frames
tags Dict[str, str] required Sentry tags

StackFrame:

Field Type Default Description
filename str required Source file
function str required Function name
lineno int required Line number
context_line Optional[str] None Source line content
in_app bool True Whether frame is in-app code

ExternalContextBase

Abstract base class for all sync services. Located in src/services/external_context/base.py.

class ExternalContextBase(ABC):
    def __init__(self, duckdb_conn, pg_conn=None, source: ExternalSource = None):
        ...

Concrete methods:

Method Parameters Returns Description
create_tag tag: ExternalTag int Creates a tag in nodes_tag_v2, returns tag ID
create_tag_edge method_id: int, tag_id: int bool Creates edge in edges_tagged_by
find_methods_by_file_lines filename: str, line_start: int, line_end: int List[Dict] Finds methods overlapping the given line range
store_external_context external_id, context_type, raw_data, linked_files=None, linked_cpg_nodes=None, external_url=None bool Stores raw context in PostgreSQL
get_existing_tags tag_name: str Dict[str, int] Returns existing tags by name as {value: id} dict

Abstract methods (must be implemented by subclasses):

Method Returns Description
sync(**kwargs) SyncResult Perform the sync operation
get_supported_tag_categories() List[str] Return list of tag categories this service creates

OrchestratorResult

Result of an orchestrated multi-source sync.

Field Type Default Description
success bool True Overall success
total_items_synced int 0 Total items across all sources
total_tags_created int 0 Total tags created
total_edges_created int 0 Total edges created
duration_seconds float 0.0 Total duration
source_results Dict[str, SyncResult] field(default_factory=dict) Per-source results
errors List[str] field(default_factory=list) Collected errors

Method: add_result(source: str, result: SyncResult) – adds a source result and updates totals.

Troubleshooting

No Tags Created

  1. Verify the CPG database has methods in nodes_method: sql SELECT COUNT(*) FROM nodes_method;
  2. Check that file paths in the CPG match the repository structure. Path mismatches (e.g., absolute vs. relative) will prevent method-to-commit linking.
  3. Run sync with use_blame=True (the default) for git – blame provides line-level precision.

Git Sync Issues

# Verify git history is accessible
python -m src.services.external_context.orchestrator \
    --duckdb /path/to/cpg.duckdb \
    --sync-git --git-since-days 7 --verbose

# Skip blame if it's too slow on large repos
python -m src.services.external_context.orchestrator \
    --duckdb /path/to/cpg.duckdb \
    --sync-git --no-blame

GitHub Rate Limiting

Use an authenticated token for higher rate limits:

service = IssueSyncService(
    duckdb_conn=conn,
    provider="github",
    repo="owner/repo",
    token=os.getenv("GITHUB_TOKEN")
)

Sentry API Errors

  1. Verify org_slug and project_slug are correct (check Sentry dashboard URL)
  2. Check the token has project:read and event:read scopes
  3. Try a smaller date range with --git-since-days 7 to confirm connectivity
  4. For self-hosted Sentry, set --sentry-url to your instance URL

Tags Exist but No Edges

If tags appear in nodes_tag_v2 but there are no edges in edges_tagged_by, it typically means file paths in the CPG don’t match the paths returned by git/Sentry. Check:

-- Inspect tag values
SELECT name, value, COUNT(*) FROM nodes_tag_v2 GROUP BY name, value LIMIT 20;

-- Check edge count
SELECT COUNT(*) FROM edges_tagged_by;

Next Steps