Importing a New Codebase

Importing a New Codebase

Guide to importing new projects into the CodeGraph system.

Note: This guide covers creating new CPG data from source code. For using existing CPG data, simply configure the cpg.db_path in config.yaml to point to your DuckDB file.

Table of Contents

Overview

The system supports automatic import of codebases with various programming languages. The process includes:

  1. Clone - repository cloning
  2. Detect Language - programming language detection
  3. Create CPG - Code Property Graph creation (GoCPG outputs DuckDB directly)
  4. Import Source Code - full source file content import into DuckDB
  5. Validate - CPG integrity validation
  6. Import Docs - documentation indexing into ChromaDB
  7. Create Plugin - Domain Plugin generation

Supported Languages

Language File Extensions Description
C/C++ .c, .h, .cpp, .hpp, .cc, .cxx C/C++ source code
C# .cs C# source code
Go .go Go source code
Java .java Java source code
JavaScript/TypeScript .js, .jsx, .ts, .tsx, .mjs JavaScript/TypeScript
Kotlin .kt, .kts Kotlin source code
PHP .php PHP source code
Python .py, .pyw Python source code
1C:Enterprise .bsl, .os 1C:Enterprise (BSL/SDBL)

CLI Usage

Full Pipeline (Single Command)

# Import from GitHub repository
python -m src.cli.import_commands full \
    --repo https://github.com/postgres/postgres \
    --branch master \
    --shallow \
    --language c

# Import local project
python -m src.cli.import_commands full \
    --path /path/to/project \
    --language java

# With selective import (only specific directories)
python -m src.cli.import_commands full \
    --repo https://github.com/postgres/postgres \
    --include src/backend src/include \
    --exclude test tests

# Import using Docker
python -m src.cli.import_commands full \
    --repo https://github.com/example/project \
    --docker

Docker Support

The system supports running GoCPG in a Docker container for cross-platform operation:

# Import with Docker (no local GoCPG build required)
python -m src.cli.import_commands full \
    --repo https://github.com/example/project \
    --docker

# With specific Docker image
python -m src.cli.import_commands full \
    --repo https://github.com/example/project \
    --docker \
    --docker-image codegraph/gocpg:v4.0.0

Docker Advantages: - No local GoCPG build required - Consistent behavior across all platforms (Windows, Linux, macOS) - Isolated execution environment - Automatic resource management

Project Management

# List all imported projects
python -m src.cli.import_commands projects list

# Project information
python -m src.cli.import_commands projects info my_project

# Activate project (set as current)
# When a project has a `domain` field, the corresponding domain plugin is activated automatically.
python -m src.cli.import_commands projects activate my_project

# Delete project (metadata only)
python -m src.cli.import_commands projects delete my_project

# Delete project with files (CPG, DuckDB)
python -m src.cli.import_commands projects delete my_project --delete-files

Projects are registered in config.yaml under projects.registry:

projects:
  active: postgres
  registry:
    postgres:
      db_path: data/projects/postgres.duckdb
      source_path: /path/to/source
      language: c
      domain: postgresql_v2    # Auto-activates domain plugin on switch
    my_python_app:
      db_path: data/projects/myapp.duckdb
      source_path: /path/to/myapp
      language: python
      domain: python_generic

The domain field is optional. When set, switching to a project automatically activates the corresponding domain plugin (e.g., postgresql_v2, python_generic). ChromaDB vector collections are also isolated per project.

Step-by-Step Import

# 1. Clone repository
python -m src.cli.import_commands clone \
    --repo https://github.com/org/repo \
    --branch main \
    --shallow \
    --depth 1

# 2. Detect language
python -m src.cli.import_commands detect --path ./workspace/repo

# 3. Create CPG (outputs DuckDB directly)
python -m src.cli.import_commands cpg \
    --path ./workspace/repo \
    --language c

# 4. Validate
python -m src.cli.import_commands validate --db ./workspace/repo.duckdb

# 5. Import documentation
python -m src.cli.import_commands docs \
    --path ./workspace/repo \
    --db ./workspace/repo.duckdb

# 6. Create Domain Plugin
python -m src.cli.import_commands domain \
    --path ./workspace/repo \
    --name my_project \
    --db ./workspace/repo.duckdb

List Supported Languages

python -m src.cli.import_commands languages

REST API Usage

Get List of Supported Languages

GET /api/v1/import/languages

Response:

{
  "languages": [
    {
      "id": "c",
      "name": "C",
      "extensions": [".c", ".h", ".cpp", ".hpp"],
      "gocpg_frontend": "c",
      "gocpg_lang": "C"
    },
    {
      "id": "java",
      "name": "JAVA",
      "extensions": [".java"],
      "gocpg_frontend": "java",
      "gocpg_lang": "JAVA"
    }
  ]
}

Start Import (Asynchronous)

POST /api/v1/import/start
Content-Type: application/json

{
  "repo_url": "https://github.com/postgres/postgres",
  "branch": "master",
  "shallow_clone": true,
  "language": null,
  "mode": "full",
  "include_paths": ["src/backend", "src/include"],
  "exclude_paths": ["test", "tests"],
  "create_domain_plugin": true,
  "import_docs": true
}

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "pending",
  "message": "Import started. Use job_id to track progress."
}

Check Import Status

GET /api/v1/import/status/{job_id}

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "project_name": "postgres",
  "status": "in_progress",
  "steps": [
    {"name": "Clone Repository", "status": "completed", "progress": 100},
    {"name": "Detect Language", "status": "completed", "progress": 100},
    {"name": "Create CPG", "status": "in_progress", "progress": 45, "message": "Creating CPG nodes..."},
    {"name": "Import Source Code", "status": "pending", "progress": 0},
    {"name": "Validate CPG", "status": "pending", "progress": 0},
    {"name": "Import Documentation", "status": "pending", "progress": 0},
    {"name": "Setup Domain Plugin", "status": "pending", "progress": 0}
  ],
  "current_step": "gocpg_parse",
  "overall_progress": 35,
  "created_at": "2024-12-09T10:00:00Z",
  "updated_at": "2024-12-09T10:05:00Z"
}

List All Import Jobs

GET /api/v1/import/jobs?status_filter=in_progress&limit=10

Cancel Import

DELETE /api/v1/import/cancel/{job_id}

Run Individual Step

POST /api/v1/import/step
Content-Type: application/json

{
  "step_id": "validate",
  "context": {
    "duckdb_path": "./workspace/project.duckdb"
  }
}

Import with Docker

POST /api/v1/import/start
Content-Type: application/json

{
  "repo_url": "https://github.com/example/project",
  "branch": "main",
  "use_docker": true,
  "docker_image": "codegraph/gocpg:latest"
}

Project Management

List projects:

GET /api/v1/import/projects

Response:

{
  "projects": [
    {
      "id": "123",
      "name": "my_project",
      "language": "python",
      "cpg_path": "./workspace/my_project.cpg",
      "duckdb_path": "./workspace/my_project.duckdb",
      "is_active": true,
      "created_at": "2024-12-10T10:00:00Z"
    }
  ]
}

Activate project:

POST /api/v1/import/projects/{project_id}/activate

Delete project:

DELETE /api/v1/import/projects/{project_id}?delete_files=true

WebSocket for Progress Tracking

const ws = new WebSocket('ws://localhost:8000/api/v1/ws/jobs/550e8400-e29b-41d4-a716-446655440000');

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);

  switch (msg.type) {
    case 'job.progress':
      console.log(`Progress: ${msg.payload.progress}% - ${msg.payload.message}`);
      break;
    case 'job.completed':
      console.log('Import completed:', msg.payload.result);
      break;
    case 'job.failed':
      console.error('Import failed:', msg.payload.error);
      break;
  }
};

Import Parameters

Import Modes

Mode Description
full Full import of entire codebase
selective Import only specified paths (include_paths)
incremental Import only changes since last import

Cloning Options

Parameter Default Description
shallow_clone true Use shallow clone
shallow_depth 1 Shallow clone depth
branch "main" Branch to clone

GoCPG Options

Parameter Default Description
gocpg_memory_gb 16 Memory for GoCPG (GB)
batch_size 10000 Batch size for DuckDB export
use_docker false Use Docker for GoCPG
docker_image codegraph/gocpg:latest GoCPG Docker image

Documentation Options

Parameter Default Description
import_docs true Import documentation
import_readme true Index README files
import_comments true Import code comments

Import Result

After successful import, the following are created:

workspace/
├── postgres/               # Source code
├── postgres.duckdb         # GoCPG CPG database
└── postgres.duckdb         # DuckDB database (graph)

chromadb_storage/
└── postgres_documentation/  # ChromaDB collection

src/domains/
└── postgres/               # Domain Plugin
    ├── __init__.py
    ├── plugin.py
    ├── subsystems.yaml
    └── prompts.yaml

Result Structure (ProjectImportResult)

{
  "cpg_path": "./workspace/postgres.cpg",
  "duckdb_path": "./workspace/postgres.duckdb",
  "domain_plugin_path": "./src/domains/postgres",
  "chromadb_collection": "postgres_documentation",
  "chromadb_stats": {
    "readme_indexed": 45,
    "docs_indexed": 230,
    "comments_indexed": 1500
  },
  "cpg_stats": {
    "methods": 125000,
    "calls": 450000,
    "identifiers": 890000
  },
  "source_code_stats": {
    "files_imported": 6307,
    "files_skipped_size": 12,
    "total_size_mb": 84.95
  },
  "validation_report": {
    "status": "passed",
    "quality_score": 85
  },
  "detected_language": "c",
  "import_duration_seconds": 3600.5
}

CPG Validation

Quality Score (0-100)

Quality assessment of the imported CPG:

Criterion Points
Methods found +50
Files linked to methods (>50%) +20
AST edges present +8
CFG edges present +7
No validation errors +15

Checked Metrics

  • methods_exist - number of methods
  • calls_exist - number of calls
  • edges_ast - AST edges
  • edges_cfg - CFG edges
  • methods_with_files - methods linked to files

Source Code Import

The SourceContentStep imports full source code content into nodes_file.content for code navigation and analysis.

How It Works

  1. Reads files from source_path specified in project configuration
  2. Populates nodes_file.content with full file contents
  3. Automatically normalizes file paths for JOIN compatibility with nodes_method
  4. Detects programming language from file extension

Supported File Extensions

Language Extensions
C/C++ .c, .h, .cpp, .hpp, .cc, .cxx
Python .py, .pyw
Java .java
JavaScript/TypeScript .js, .jsx, .ts, .tsx
Go .go
Rust .rs
PHP .php
C# .cs
Kotlin .kt, .kts
1C:Enterprise .bsl, .os
Scala .scala
SQL .sql
Shell .sh, .bash
Config .yaml, .yml, .json, .xml, .toml, .ini

File Size Limit

Files larger than 500 KB are skipped to keep database size manageable. This limit covers most source files while excluding large generated or binary files.

Path Normalization

File paths in nodes_file.name are automatically normalized to match nodes_method.filename format. Common prefixes like src/ are stripped to enable direct JOINs:

-- Get method source code by line number
SELECT
    m.full_name,
    m.line_number,
    m.line_number_end,
    f.content
FROM nodes_method m
JOIN nodes_file f ON REPLACE(m.filename, '/', '\') = REPLACE(f.name, '/', '\')
WHERE m.full_name = 'exec_simple_query';

Import Statistics

After import, the following statistics are available:

Metric Description
source_files_imported Number of files successfully imported
source_files_skipped_size Files skipped due to size limit
source_files_skipped_not_found Files not found in source path
source_files_total Total files processed

Domain Plugin

A plugin is automatically generated for working with the new project.

Plugin Structure

# src/domains/my_project/plugin.py

class MyProjectPlugin(DomainPlugin):
    @property
    def name(self) -> str:
        return "my_project"

    @property
    def display_name(self) -> str:
        return "My Project"

    def _load_subsystems(self) -> Dict[str, SubsystemInfo]:
        # Load from subsystems.yaml
        ...

    def get_vulnerability_function_mappings(self) -> Dict[str, List[str]]:
        return {
            "buffer_overflow": ["strcpy", "memcpy", ...],
            "sql_injection": [...],
            ...
        }

Configuration: subsystems.yaml

subsystems:
  core:
    description: "Core application logic"
    key_functions:
      - main
      - init
      - start
    patterns:
      - "src"
      - "lib"
    related_files: []

  utils:
    description: "Utility functions"
    key_functions: []
    patterns:
      - "util"
      - "helper"

Configuration: prompts.yaml

prompts:
  onboarding:
    system: |
      You are a My Project expert helping developers understand the codebase.
    user_template: |
      Help me understand the following aspect: {query}

  security:
    system: |
      You are a security expert analyzing My Project (C) code.
    user_template: |
      Analyze the following code for security vulnerabilities:
      {code}

Activating Domain Plugin

After creating the plugin, add it to the configuration:

# config.yaml
domains:
  active: "my_project"
  available:
    - postgresql_v2
    - my_project

Or programmatically:

from src.domains import DomainRegistry

DomainRegistry.activate("my_project")

Handling Large Repositories

Large C/C++ Projects

Note: Large C/C++ projects use the generic_cpp domain plugin for analysis.

# Use shallow clone
python -m src.cli.import_commands full \
    --repo https://github.com/postgres/postgres \
    --shallow \
    --depth 1

# Or selective import
python -m src.cli.import_commands full \
    --repo https://github.com/postgres/postgres \
    --include src/backend/executor \
    --mode selective

# Increase memory for GoCPG
python -m src.cli.import_commands full \
    --repo https://github.com/postgres/postgres \
    --memory 32

Recommendations

  1. Use shallow clone to save space and time
  2. Select needed directories via --include
  3. Exclude tests via --exclude test tests
  4. Increase GoCPG memory for large projects (16-32GB)

Python API

from src.project_import import (
    ProjectImportPipeline,
    ProjectImportRequest,
    SupportedLanguage,
    ImportMode,
)

# Create request
request = ProjectImportRequest(
    repo_url="https://github.com/example/project",
    branch="main",
    shallow_clone=True,
    language=SupportedLanguage.JAVA,  # or None for auto-detection
    mode=ImportMode.FULL,
    include_paths=["src/main"],
    exclude_paths=["src/test"],
    create_domain_plugin=True,
    import_docs=True,
)

# Run pipeline
async def run_import():
    def progress_callback(status):
        print(f"Progress: {status.overall_progress}% - {status.current_step}")

    pipeline = ProjectImportPipeline(progress_callback=progress_callback)
    result = await pipeline.run(request)

    print(f"CPG: {result.cpg_path}")
    print(f"DuckDB: {result.duckdb_path}")
    print(f"Language: {result.detected_language}")
    print(f"Quality Score: {result.validation_report['quality_score']}")

import asyncio
asyncio.run(run_import())

Running Individual Steps

from src.project_import.pipeline import ProjectImportPipeline

pipeline = ProjectImportPipeline()

# Step context
context = {
    "request": ProjectImportRequest(),
    "source_path": Path("./workspace/project"),
    "duckdb_path": "./workspace/project.duckdb",
}

# Run validation step
result = await pipeline.run_step("validate", context)
print(result["validation_report"])

Troubleshooting

GoCPG Frontend Not Found

RuntimeError: Frontend not found at expected paths

Solution: Check GOCPG_HOME or specify explicitly:

export GOCPG_PATH=/path/to/gocpg
python -m src.cli.import_commands full --repo ...

GoCPG Process Failure

Error: GoCPG binary exits with non-zero code

Solution: - Check available disk space for DuckDB output - Verify source path is accessible: ls <source_path> - Run with verbose logging: gocpg parse --input=<path> --output=<db> --lang=c -v - Increase memory allocation if processing a large codebase: --memory 32

Language Not Detected

ValueError: No supported source files found

Solution: Specify language explicitly:

python -m src.cli.import_commands full --repo ... --language java

CPG Validation Failed

Validation errors: ['methods_exist: expected >= 1, got 0']

Solution: Check: 1. Source code path is correct 2. GoCPG frontend matches the language 3. Files are not excluded by patterns


Configuration (config.yaml)

Settings for the project_import module in config.yaml:

project_import:
  gocpg:
    # Path to local GoCPG binary (optional if using Docker)
    home: ${GOCPG_HOME}
    # Use Docker instead of local GoCPG
    use_docker: false
    # Docker image for GoCPG
    docker_image: "codegraph/gocpg:latest"
    # Memory limit (GB)
    memory_gb: 16

  workspace:
    # Directory for cloned repositories
    clone_dir: "./workspace"
    # Directory for CPG files
    cpg_dir: "./workspace"
    # Directory for DuckDB files
    duckdb_dir: "./workspace"

  defaults:
    # Default shallow clone depth
    shallow_depth: 1
    # Default exclusion patterns
    exclude_patterns:
      - "node_modules"
      - "venv"
      - ".venv"
      - "__pycache__"
      - ".git"
      - "test"
      - "tests"
      - "vendor"
      - "third_party"

Component Architecture

ProjectRegistry

Project registry in PostgreSQL:

from src.project_import import ProjectRegistry

async with ProjectRegistry() as registry:
    # List projects
    projects = await registry.list_projects()

    # Activate project
    await registry.set_active_project("my_project")

    # Delete project
    await registry.delete_project("old_project", delete_files=True)

GoCPGClient (Unified Wrapper)

As an alternative to direct subprocess calls via runners, the GoCPGClient provides a unified async Python wrapper for all GoCPG commands with Pydantic result models:

from src.services.gocpg import GoCPGClient

client = GoCPGClient()  # auto-detects binary path from config.yaml
result = await client.parse(input_path="/src", output_path="data/projects/postgres.duckdb", language="c")
result = await client.update(input_path="/src", output_path="data/projects/postgres.duckdb", force=True)
ci_result = await client.ci_update(input_path="/src", output_path="data/projects/postgres.duckdb", base_ref="origin/main")
stats = await client.stats()

See src/services/gocpg/ for the full client API (17 async methods, 12 Pydantic models).

LocalGoCPGRunner / DockerGoCPGRunner

Runners for executing GoCPG commands:

# Local execution
from src.project_import import LocalGoCPGRunner
runner = LocalGoCPGRunner(gocpg_path="/path/to/gocpg")

# Docker execution
from src.project_import import DockerGoCPGRunner
runner = DockerGoCPGRunner(image="codegraph/gocpg:latest")

# Run parse
await runner.run_parse(source_path, output_db, language="python")

See Also