Importing a New Codebase¶

Guide to importing new projects into the CodeGraph system.

Note: This guide covers creating new CPG data from source code. For using existing CPG data, simply configure the cpg.db_path in config.yaml to point to your DuckDB file.

Table of Contents¶

Overview
Supported Languages
CLI Usage
Full Pipeline (Single Command)
Docker Support
Project Management
Step-by-Step Import
List Supported Languages
REST API Usage
Get List of Supported Languages
Start Import (Asynchronous)
Check Import Status
List All Import Jobs
Cancel Import
Run Individual Step
Import with Docker
Project Management
WebSocket for Progress Tracking
Import Parameters
Import Modes
Cloning Options
GoCPG Options
Documentation Options
Import Result
Result Structure (ProjectImportResult)
CPG Validation
Quality Score (0-100)
Checked Metrics
Source Code Import
How It Works
Supported File Extensions
File Size Limit
Path Normalization
Import Statistics
Domain Plugin
Plugin Structure
Configuration: subsystems.yaml
Configuration: prompts.yaml
Activating Domain Plugin
Handling Large Repositories
Large C/C++ Projects
Recommendations
Python API
Running Individual Steps
Troubleshooting
GoCPG Frontend Not Found
GoCPG Process Failure
Language Not Detected
CPG Validation Failed
Configuration (config.yaml)
Component Architecture
ProjectRegistry
LocalGoCPGRunner / DockerGoCPGRunner
See Also

Overview¶

The system supports automatic import of codebases with various programming languages. The process includes:

Clone - repository cloning
Detect Language - programming language detection
Create CPG - Code Property Graph creation (GoCPG outputs DuckDB directly)
Import Source Code - full source file content import into DuckDB
Validate - CPG integrity validation
Import Docs - documentation indexing into ChromaDB
Create Plugin - Domain Plugin generation

Supported Languages¶

Language	File Extensions	Description
C/C++	`.c`, `.h`, `.cpp`, `.hpp`, `.cc`, `.cxx`	C/C++ source code
C#	`.cs`	C# source code
Go	`.go`	Go source code
Java	`.java`	Java source code
JavaScript/TypeScript	`.js`, `.jsx`, `.ts`, `.tsx`, `.mjs`	JavaScript/TypeScript
Kotlin	`.kt`, `.kts`	Kotlin source code
PHP	`.php`	PHP source code
Python	`.py`, `.pyw`	Python source code
1C:Enterprise	`.bsl`, `.os`	1C:Enterprise (BSL/SDBL)

CLI Usage¶

Full Pipeline (Single Command)¶

# Import from GitHub repository
python -m src.cli.import_commands full \
    --repo https://github.com/postgres/postgres \
    --branch master \
    --shallow \
    --language c

# Import local project
python -m src.cli.import_commands full \
    --path /path/to/project \
    --language java

# With selective import (only specific directories)
python -m src.cli.import_commands full \
    --repo https://github.com/postgres/postgres \
    --include src/backend src/include \
    --exclude test tests

# Import using Docker
python -m src.cli.import_commands full \
    --repo https://github.com/example/project \
    --docker

Docker Support¶

The system supports running GoCPG in a Docker container for cross-platform operation:

# Import with Docker (no local GoCPG build required)
python -m src.cli.import_commands full \
    --repo https://github.com/example/project \
    --docker

# With specific Docker image
python -m src.cli.import_commands full \
    --repo https://github.com/example/project \
    --docker \
    --docker-image codegraph/gocpg:v4.0.0

Docker Advantages: - No local GoCPG build required - Consistent behavior across all platforms (Windows, Linux, macOS) - Isolated execution environment - Automatic resource management

Project Management¶

# List all imported projects
python -m src.cli.import_commands projects list

# Project information
python -m src.cli.import_commands projects info my_project

# Activate project (set as current)
# When a project has a `domain` field, the corresponding domain plugin is activated automatically.
python -m src.cli.import_commands projects activate my_project

# Delete project (metadata only)
python -m src.cli.import_commands projects delete my_project

# Delete project with files (CPG, DuckDB)
python -m src.cli.import_commands projects delete my_project --delete-files

Projects are registered in config.yaml under projects.registry:

projects:
  active: postgres
  registry:
    postgres:
      db_path: data/projects/postgres.duckdb
      source_path: /path/to/source
      language: c
      domain: postgresql_v2    # Auto-activates domain plugin on switch
    my_python_app:
      db_path: data/projects/myapp.duckdb
      source_path: /path/to/myapp
      language: python
      domain: python_generic

The domain field is optional. When set, switching to a project automatically activates the corresponding domain plugin (e.g., postgresql_v2, python_generic). ChromaDB vector collections are also isolated per project.

Step-by-Step Import¶

# 1. Clone repository
python -m src.cli.import_commands clone \
    --repo https://github.com/org/repo \
    --branch main \
    --shallow \
    --depth 1

# 2. Detect language
python -m src.cli.import_commands detect --path ./workspace/repo

# 3. Create CPG (outputs DuckDB directly)
python -m src.cli.import_commands cpg \
    --path ./workspace/repo \
    --language c

# 4. Validate
python -m src.cli.import_commands validate --db ./workspace/repo.duckdb

# 5. Import documentation
python -m src.cli.import_commands docs \
    --path ./workspace/repo \
    --db ./workspace/repo.duckdb

# 6. Create Domain Plugin
python -m src.cli.import_commands domain \
    --path ./workspace/repo \
    --name my_project \
    --db ./workspace/repo.duckdb

List Supported Languages¶

python -m src.cli.import_commands languages

REST API Usage¶

Get List of Supported Languages¶

GET /api/v1/import/languages

Response:

{
  "languages": [
    {
      "id": "c",
      "name": "C",
      "extensions": [".c", ".h", ".cpp", ".hpp"],
      "gocpg_frontend": "c",
      "gocpg_lang": "C"
    },
    {
      "id": "java",
      "name": "JAVA",
      "extensions": [".java"],
      "gocpg_frontend": "java",
      "gocpg_lang": "JAVA"
    }
  ]
}

Start Import (Asynchronous)¶

POST /api/v1/import/start
Content-Type: application/json

{
  "repo_url": "https://github.com/postgres/postgres",
  "branch": "master",
  "shallow_clone": true,
  "language": null,
  "mode": "full",
  "include_paths": ["src/backend", "src/include"],
  "exclude_paths": ["test", "tests"],
  "create_domain_plugin": true,
  "import_docs": true
}

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "pending",
  "message": "Import started. Use job_id to track progress."
}

Check Import Status¶

GET /api/v1/import/status/{job_id}

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "project_name": "postgres",
  "status": "in_progress",
  "steps": [
    {"name": "Clone Repository", "status": "completed", "progress": 100},
    {"name": "Detect Language", "status": "completed", "progress": 100},
    {"name": "Create CPG", "status": "in_progress", "progress": 45, "message": "Creating CPG nodes..."},
    {"name": "Import Source Code", "status": "pending", "progress": 0},
    {"name": "Validate CPG", "status": "pending", "progress": 0},
    {"name": "Import Documentation", "status": "pending", "progress": 0},
    {"name": "Setup Domain Plugin", "status": "pending", "progress": 0}
  ],
  "current_step": "gocpg_parse",
  "overall_progress": 35,
  "created_at": "2024-12-09T10:00:00Z",
  "updated_at": "2024-12-09T10:05:00Z"
}

List All Import Jobs¶

GET /api/v1/import/jobs?status_filter=in_progress&limit=10

Cancel Import¶

DELETE /api/v1/import/cancel/{job_id}

Run Individual Step¶

POST /api/v1/import/step
Content-Type: application/json

{
  "step_id": "validate",
  "context": {
    "duckdb_path": "./workspace/project.duckdb"
  }
}

Import with Docker¶

POST /api/v1/import/start
Content-Type: application/json

{
  "repo_url": "https://github.com/example/project",
  "branch": "main",
  "use_docker": true,
  "docker_image": "codegraph/gocpg:latest"
}

Project Management¶

List projects:

GET /api/v1/import/projects

Response:

{
  "projects": [
    {
      "id": "123",
      "name": "my_project",
      "language": "python",
      "cpg_path": "./workspace/my_project.cpg",
      "duckdb_path": "./workspace/my_project.duckdb",
      "is_active": true,
      "created_at": "2024-12-10T10:00:00Z"
    }
  ]
}

Activate project:

POST /api/v1/import/projects/{project_id}/activate

Delete project:

DELETE /api/v1/import/projects/{project_id}?delete_files=true

WebSocket for Progress Tracking¶

const ws = new WebSocket('ws://localhost:8000/api/v1/ws/jobs/550e8400-e29b-41d4-a716-446655440000');

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);

  switch (msg.type) {
    case 'job.progress':
      console.log(`Progress: ${msg.payload.progress}% - ${msg.payload.message}`);
      break;
    case 'job.completed':
      console.log('Import completed:', msg.payload.result);
      break;
    case 'job.failed':
      console.error('Import failed:', msg.payload.error);
      break;
  }
};

Import Parameters¶

Import Modes¶

Mode	Description
`full`	Full import of entire codebase
`selective`	Import only specified paths (`include_paths`)
`incremental`	Import only changes since last import

Cloning Options¶

Parameter	Default	Description
`shallow_clone`	`true`	Use shallow clone
`shallow_depth`	`1`	Shallow clone depth
`branch`	`"main"`	Branch to clone

GoCPG Options¶

Parameter	Default	Description
`gocpg_memory_gb`	`16`	Memory for GoCPG (GB)
`batch_size`	`10000`	Batch size for DuckDB export
`use_docker`	`false`	Use Docker for GoCPG
`docker_image`	`codegraph/gocpg:latest`	GoCPG Docker image

Documentation Options¶

Parameter	Default	Description
`import_docs`	`true`	Import documentation
`import_readme`	`true`	Index README files
`import_comments`	`true`	Import code comments

Import Result¶

After successful import, the following are created:

workspace/
├── postgres/               # Source code
├── postgres.duckdb         # GoCPG CPG database
└── postgres.duckdb         # DuckDB database (graph)

chromadb_storage/
└── postgres_documentation/  # ChromaDB collection

src/domains/
└── postgres/               # Domain Plugin
    ├── __init__.py
    ├── plugin.py
    ├── subsystems.yaml
    └── prompts.yaml

Result Structure (ProjectImportResult)¶

{
  "cpg_path": "./workspace/postgres.cpg",
  "duckdb_path": "./workspace/postgres.duckdb",
  "domain_plugin_path": "./src/domains/postgres",
  "chromadb_collection": "postgres_documentation",
  "chromadb_stats": {
    "readme_indexed": 45,
    "docs_indexed": 230,
    "comments_indexed": 1500
  },
  "cpg_stats": {
    "methods": 125000,
    "calls": 450000,
    "identifiers": 890000
  },
  "source_code_stats": {
    "files_imported": 6307,
    "files_skipped_size": 12,
    "total_size_mb": 84.95
  },
  "validation_report": {
    "status": "passed",
    "quality_score": 85
  },
  "detected_language": "c",
  "import_duration_seconds": 3600.5
}

CPG Validation¶

Quality Score (0-100)¶

Quality assessment of the imported CPG:

Criterion	Points
Methods found	+50
Files linked to methods (>50%)	+20
AST edges present	+8
CFG edges present	+7
No validation errors	+15

Checked Metrics¶

methods_exist - number of methods
calls_exist - number of calls
edges_ast - AST edges
edges_cfg - CFG edges
methods_with_files - methods linked to files

Source Code Import¶

The SourceContentStep imports full source code content into nodes_file.content for code navigation and analysis.

How It Works¶

Reads files from source_path specified in project configuration
Populates nodes_file.content with full file contents
Automatically normalizes file paths for JOIN compatibility with nodes_method
Detects programming language from file extension

Supported File Extensions¶

Language	Extensions
C/C++	`.c`, `.h`, `.cpp`, `.hpp`, `.cc`, `.cxx`
Python	`.py`, `.pyw`
Java	`.java`
JavaScript/TypeScript	`.js`, `.jsx`, `.ts`, `.tsx`
Go	`.go`
Rust	`.rs`
PHP	`.php`
C#	`.cs`
Kotlin	`.kt`, `.kts`
1C:Enterprise	`.bsl`, `.os`
Scala	`.scala`
SQL	`.sql`
Shell	`.sh`, `.bash`
Config	`.yaml`, `.yml`, `.json`, `.xml`, `.toml`, `.ini`

File Size Limit¶

Files larger than 500 KB are skipped to keep database size manageable. This limit covers most source files while excluding large generated or binary files.

Path Normalization¶

File paths in nodes_file.name are automatically normalized to match nodes_method.filename format. Common prefixes like src/ are stripped to enable direct JOINs:

-- Get method source code by line number
SELECT
    m.full_name,
    m.line_number,
    m.line_number_end,
    f.content
FROM nodes_method m
JOIN nodes_file f ON REPLACE(m.filename, '/', '\') = REPLACE(f.name, '/', '\')
WHERE m.full_name = 'exec_simple_query';

Import Statistics¶

After import, the following statistics are available:

Metric	Description
`source_files_imported`	Number of files successfully imported
`source_files_skipped_size`	Files skipped due to size limit
`source_files_skipped_not_found`	Files not found in source path
`source_files_total`	Total files processed

Domain Plugin¶

A plugin is automatically generated for working with the new project.

Plugin Structure¶

# src/domains/my_project/plugin.py

class MyProjectPlugin(DomainPlugin):
    @property
    def name(self) -> str:
        return "my_project"

    @property
    def display_name(self) -> str:
        return "My Project"

    def _load_subsystems(self) -> Dict[str, SubsystemInfo]:
        # Load from subsystems.yaml
        ...

    def get_vulnerability_function_mappings(self) -> Dict[str, List[str]]:
        return {
            "buffer_overflow": ["strcpy", "memcpy", ...],
            "sql_injection": [...],
            ...
        }

Configuration: subsystems.yaml¶

subsystems:
  core:
    description: "Core application logic"
    key_functions:
      - main
      - init
      - start
    patterns:
      - "src"
      - "lib"
    related_files: []

  utils:
    description: "Utility functions"
    key_functions: []
    patterns:
      - "util"
      - "helper"

Configuration: prompts.yaml¶

prompts:
  onboarding:
    system: |
      You are a My Project expert helping developers understand the codebase.
    user_template: |
      Help me understand the following aspect: {query}

  security:
    system: |
      You are a security expert analyzing My Project (C) code.
    user_template: |
      Analyze the following code for security vulnerabilities:
      {code}

Activating Domain Plugin¶

After creating the plugin, add it to the configuration:

# config.yaml
domains:
  active: "my_project"
  available:
    - postgresql_v2
    - my_project

Or programmatically:

from src.domains import DomainRegistry

DomainRegistry.activate("my_project")

Handling Large Repositories¶

Large C/C++ Projects¶

Note: Large C/C++ projects use the generic_cpp domain plugin for analysis.

# Use shallow clone
python -m src.cli.import_commands full \
    --repo https://github.com/postgres/postgres \
    --shallow \
    --depth 1

# Or selective import
python -m src.cli.import_commands full \
    --repo https://github.com/postgres/postgres \
    --include src/backend/executor \
    --mode selective

# Increase memory for GoCPG
python -m src.cli.import_commands full \
    --repo https://github.com/postgres/postgres \
    --memory 32

Recommendations¶

Use shallow clone to save space and time
Select needed directories via --include
Exclude tests via --exclude test tests
Increase GoCPG memory for large projects (16-32GB)

Python API¶

from src.project_import import (
    ProjectImportPipeline,
    ProjectImportRequest,
    SupportedLanguage,
    ImportMode,
)

# Create request
request = ProjectImportRequest(
    repo_url="https://github.com/example/project",
    branch="main",
    shallow_clone=True,
    language=SupportedLanguage.JAVA,  # or None for auto-detection
    mode=ImportMode.FULL,
    include_paths=["src/main"],
    exclude_paths=["src/test"],
    create_domain_plugin=True,
    import_docs=True,
)

# Run pipeline
async def run_import():
    def progress_callback(status):
        print(f"Progress: {status.overall_progress}% - {status.current_step}")

    pipeline = ProjectImportPipeline(progress_callback=progress_callback)
    result = await pipeline.run(request)

    print(f"CPG: {result.cpg_path}")
    print(f"DuckDB: {result.duckdb_path}")
    print(f"Language: {result.detected_language}")
    print(f"Quality Score: {result.validation_report['quality_score']}")

import asyncio
asyncio.run(run_import())

Running Individual Steps¶

from src.project_import.pipeline import ProjectImportPipeline

pipeline = ProjectImportPipeline()

# Step context
context = {
    "request": ProjectImportRequest(),
    "source_path": Path("./workspace/project"),
    "duckdb_path": "./workspace/project.duckdb",
}

# Run validation step
result = await pipeline.run_step("validate", context)
print(result["validation_report"])

Troubleshooting¶

GoCPG Frontend Not Found¶

RuntimeError: Frontend not found at expected paths

Solution: Check GOCPG_HOME or specify explicitly:

export GOCPG_PATH=/path/to/gocpg
python -m src.cli.import_commands full --repo ...

GoCPG Process Failure¶

Error: GoCPG binary exits with non-zero code

Solution: - Check available disk space for DuckDB output - Verify source path is accessible: ls <source_path> - Run with verbose logging: gocpg parse --input=<path> --output=<db> --lang=c -v - Increase memory allocation if processing a large codebase: --memory 32

Language Not Detected¶

ValueError: No supported source files found

Solution: Specify language explicitly:

python -m src.cli.import_commands full --repo ... --language java

CPG Validation Failed¶

Validation errors: ['methods_exist: expected >= 1, got 0']

Solution: Check: 1. Source code path is correct 2. GoCPG frontend matches the language 3. Files are not excluded by patterns

Configuration (config.yaml)¶

Settings for the project_import module in config.yaml:

project_import:
  gocpg:
    # Path to local GoCPG binary (optional if using Docker)
    home: ${GOCPG_HOME}
    # Use Docker instead of local GoCPG
    use_docker: false
    # Docker image for GoCPG
    docker_image: "codegraph/gocpg:latest"
    # Memory limit (GB)
    memory_gb: 16

  workspace:
    # Directory for cloned repositories
    clone_dir: "./workspace"
    # Directory for CPG files
    cpg_dir: "./workspace"
    # Directory for DuckDB files
    duckdb_dir: "./workspace"

  defaults:
    # Default shallow clone depth
    shallow_depth: 1
    # Default exclusion patterns
    exclude_patterns:
      - "node_modules"
      - "venv"
      - ".venv"
      - "__pycache__"
      - ".git"
      - "test"
      - "tests"
      - "vendor"
      - "third_party"

Component Architecture¶

ProjectRegistry¶

Project registry in PostgreSQL:

from src.project_import import ProjectRegistry

async with ProjectRegistry() as registry:
    # List projects
    projects = await registry.list_projects()

    # Activate project
    await registry.set_active_project("my_project")

    # Delete project
    await registry.delete_project("old_project", delete_files=True)

GoCPGClient (Unified Wrapper)¶

As an alternative to direct subprocess calls via runners, the GoCPGClient provides a unified async Python wrapper for all GoCPG commands with Pydantic result models:

from src.services.gocpg import GoCPGClient

client = GoCPGClient()  # auto-detects binary path from config.yaml
result = await client.parse(input_path="/src", output_path="data/projects/postgres.duckdb", language="c")
result = await client.update(input_path="/src", output_path="data/projects/postgres.duckdb", force=True)
ci_result = await client.ci_update(input_path="/src", output_path="data/projects/postgres.duckdb", base_ref="origin/main")
stats = await client.stats()

See src/services/gocpg/ for the full client API (17 async methods, 12 Pydantic models).

LocalGoCPGRunner / DockerGoCPGRunner¶

Runners for executing GoCPG commands:

# Local execution
from src.project_import import LocalGoCPGRunner
runner = LocalGoCPGRunner(gocpg_path="/path/to/gocpg")

# Docker execution
from src.project_import import DockerGoCPGRunner
runner = DockerGoCPGRunner(image="codegraph/gocpg:latest")

# Run parse
await runner.run_parse(source_path, output_db, language="python")

Importing a New Codebase

Importing a New Codebase¶

Table of Contents¶

Overview¶

Supported Languages¶

CLI Usage¶

Full Pipeline (Single Command)¶

Docker Support¶

Project Management¶

Step-by-Step Import¶

List Supported Languages¶

REST API Usage¶

Get List of Supported Languages¶

Start Import (Asynchronous)¶

Check Import Status¶

List All Import Jobs¶

Cancel Import¶

Run Individual Step¶

Import with Docker¶

Project Management¶

WebSocket for Progress Tracking¶

Import Parameters¶

Import Modes¶

Cloning Options¶

GoCPG Options¶

Documentation Options¶

Import Result¶

Result Structure (ProjectImportResult)¶

CPG Validation¶

Quality Score (0-100)¶

Checked Metrics¶

Source Code Import¶

How It Works¶

Supported File Extensions¶

File Size Limit¶

Path Normalization¶

Import Statistics¶

Domain Plugin¶

Plugin Structure¶

Configuration: subsystems.yaml¶

Configuration: prompts.yaml¶

Activating Domain Plugin¶

Handling Large Repositories¶

Large C/C++ Projects¶

Recommendations¶

Python API¶

Running Individual Steps¶

Troubleshooting¶

GoCPG Frontend Not Found¶

GoCPG Process Failure¶

Language Not Detected¶

CPG Validation Failed¶

Configuration (config.yaml)¶

Component Architecture¶

ProjectRegistry¶

GoCPGClient (Unified Wrapper)¶

LocalGoCPGRunner / DockerGoCPGRunner¶

See Also¶