Importing a New Codebase¶
Guide to importing new projects into the CodeGraph system.
Note: This guide covers creating new CPG data from source code. For using existing CPG data, simply configure the
cpg.db_pathinconfig.yamlto point to your DuckDB file.
Table of Contents¶
- Overview
- Supported Languages
- CLI Usage
- Full Pipeline (Single Command)
- Docker Support
- Project Management
- Step-by-Step Import
- List Supported Languages
- REST API Usage
- Get List of Supported Languages
- Start Import (Asynchronous)
- Check Import Status
- List All Import Jobs
- Cancel Import
- Run Individual Step
- Import with Docker
- Project Management
- WebSocket for Progress Tracking
- Import Parameters
- Import Modes
- Cloning Options
- GoCPG Options
- Documentation Options
- Import Result
- Result Structure (ProjectImportResult)
- CPG Validation
- Quality Score (0-100)
- Checked Metrics
- Source Code Import
- How It Works
- Supported File Extensions
- File Size Limit
- Path Normalization
- Import Statistics
- Domain Plugin
- Plugin Structure
- Configuration: subsystems.yaml
- Configuration: prompts.yaml
- Activating Domain Plugin
- Handling Large Repositories
- Large C/C++ Projects
- Recommendations
- Python API
- Running Individual Steps
- Troubleshooting
- GoCPG Frontend Not Found
- GoCPG Process Failure
- Language Not Detected
- CPG Validation Failed
- Configuration (config.yaml)
- Component Architecture
- ProjectRegistry
- LocalGoCPGRunner / DockerGoCPGRunner
- See Also
Overview¶
The system supports automatic import of codebases with various programming languages. The process includes:
- Clone - repository cloning
- Detect Language - programming language detection
- Create CPG - Code Property Graph creation (GoCPG outputs DuckDB directly)
- Import Source Code - full source file content import into DuckDB
- Validate - CPG integrity validation
- Import Docs - documentation indexing into ChromaDB
- Create Plugin - Domain Plugin generation
Supported Languages¶
| Language | File Extensions | Description |
|---|---|---|
| C/C++ | .c, .h, .cpp, .hpp, .cc, .cxx |
C/C++ source code |
| C# | .cs |
C# source code |
| Go | .go |
Go source code |
| Java | .java |
Java source code |
| JavaScript/TypeScript | .js, .jsx, .ts, .tsx, .mjs |
JavaScript/TypeScript |
| Kotlin | .kt, .kts |
Kotlin source code |
| PHP | .php |
PHP source code |
| Python | .py, .pyw |
Python source code |
| 1C:Enterprise | .bsl, .os |
1C:Enterprise (BSL/SDBL) |
CLI Usage¶
Full Pipeline (Single Command)¶
# Import from GitHub repository
python -m src.cli.import_commands full \
--repo https://github.com/postgres/postgres \
--branch master \
--shallow \
--language c
# Import local project
python -m src.cli.import_commands full \
--path /path/to/project \
--language java
# With selective import (only specific directories)
python -m src.cli.import_commands full \
--repo https://github.com/postgres/postgres \
--include src/backend src/include \
--exclude test tests
# Import using Docker
python -m src.cli.import_commands full \
--repo https://github.com/example/project \
--docker
Docker Support¶
The system supports running GoCPG in a Docker container for cross-platform operation:
# Import with Docker (no local GoCPG build required)
python -m src.cli.import_commands full \
--repo https://github.com/example/project \
--docker
# With specific Docker image
python -m src.cli.import_commands full \
--repo https://github.com/example/project \
--docker \
--docker-image codegraph/gocpg:v4.0.0
Docker Advantages: - No local GoCPG build required - Consistent behavior across all platforms (Windows, Linux, macOS) - Isolated execution environment - Automatic resource management
Project Management¶
# List all imported projects
python -m src.cli.import_commands projects list
# Project information
python -m src.cli.import_commands projects info my_project
# Activate project (set as current)
# When a project has a `domain` field, the corresponding domain plugin is activated automatically.
python -m src.cli.import_commands projects activate my_project
# Delete project (metadata only)
python -m src.cli.import_commands projects delete my_project
# Delete project with files (CPG, DuckDB)
python -m src.cli.import_commands projects delete my_project --delete-files
Projects are registered in config.yaml under projects.registry:
projects:
active: postgres
registry:
postgres:
db_path: data/projects/postgres.duckdb
source_path: /path/to/source
language: c
domain: postgresql_v2 # Auto-activates domain plugin on switch
my_python_app:
db_path: data/projects/myapp.duckdb
source_path: /path/to/myapp
language: python
domain: python_generic
The domain field is optional. When set, switching to a project automatically activates the corresponding domain plugin (e.g., postgresql_v2, python_generic). ChromaDB vector collections are also isolated per project.
Step-by-Step Import¶
# 1. Clone repository
python -m src.cli.import_commands clone \
--repo https://github.com/org/repo \
--branch main \
--shallow \
--depth 1
# 2. Detect language
python -m src.cli.import_commands detect --path ./workspace/repo
# 3. Create CPG (outputs DuckDB directly)
python -m src.cli.import_commands cpg \
--path ./workspace/repo \
--language c
# 4. Validate
python -m src.cli.import_commands validate --db ./workspace/repo.duckdb
# 5. Import documentation
python -m src.cli.import_commands docs \
--path ./workspace/repo \
--db ./workspace/repo.duckdb
# 6. Create Domain Plugin
python -m src.cli.import_commands domain \
--path ./workspace/repo \
--name my_project \
--db ./workspace/repo.duckdb
List Supported Languages¶
python -m src.cli.import_commands languages
REST API Usage¶
Get List of Supported Languages¶
GET /api/v1/import/languages
Response:
{
"languages": [
{
"id": "c",
"name": "C",
"extensions": [".c", ".h", ".cpp", ".hpp"],
"gocpg_frontend": "c",
"gocpg_lang": "C"
},
{
"id": "java",
"name": "JAVA",
"extensions": [".java"],
"gocpg_frontend": "java",
"gocpg_lang": "JAVA"
}
]
}
Start Import (Asynchronous)¶
POST /api/v1/import/start
Content-Type: application/json
{
"repo_url": "https://github.com/postgres/postgres",
"branch": "master",
"shallow_clone": true,
"language": null,
"mode": "full",
"include_paths": ["src/backend", "src/include"],
"exclude_paths": ["test", "tests"],
"create_domain_plugin": true,
"import_docs": true
}
Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "pending",
"message": "Import started. Use job_id to track progress."
}
Check Import Status¶
GET /api/v1/import/status/{job_id}
Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"project_name": "postgres",
"status": "in_progress",
"steps": [
{"name": "Clone Repository", "status": "completed", "progress": 100},
{"name": "Detect Language", "status": "completed", "progress": 100},
{"name": "Create CPG", "status": "in_progress", "progress": 45, "message": "Creating CPG nodes..."},
{"name": "Import Source Code", "status": "pending", "progress": 0},
{"name": "Validate CPG", "status": "pending", "progress": 0},
{"name": "Import Documentation", "status": "pending", "progress": 0},
{"name": "Setup Domain Plugin", "status": "pending", "progress": 0}
],
"current_step": "gocpg_parse",
"overall_progress": 35,
"created_at": "2024-12-09T10:00:00Z",
"updated_at": "2024-12-09T10:05:00Z"
}
List All Import Jobs¶
GET /api/v1/import/jobs?status_filter=in_progress&limit=10
Cancel Import¶
DELETE /api/v1/import/cancel/{job_id}
Run Individual Step¶
POST /api/v1/import/step
Content-Type: application/json
{
"step_id": "validate",
"context": {
"duckdb_path": "./workspace/project.duckdb"
}
}
Import with Docker¶
POST /api/v1/import/start
Content-Type: application/json
{
"repo_url": "https://github.com/example/project",
"branch": "main",
"use_docker": true,
"docker_image": "codegraph/gocpg:latest"
}
Project Management¶
List projects:
GET /api/v1/import/projects
Response:
{
"projects": [
{
"id": "123",
"name": "my_project",
"language": "python",
"cpg_path": "./workspace/my_project.cpg",
"duckdb_path": "./workspace/my_project.duckdb",
"is_active": true,
"created_at": "2024-12-10T10:00:00Z"
}
]
}
Activate project:
POST /api/v1/import/projects/{project_id}/activate
Delete project:
DELETE /api/v1/import/projects/{project_id}?delete_files=true
WebSocket for Progress Tracking¶
const ws = new WebSocket('ws://localhost:8000/api/v1/ws/jobs/550e8400-e29b-41d4-a716-446655440000');
ws.onmessage = (event) => {
const msg = JSON.parse(event.data);
switch (msg.type) {
case 'job.progress':
console.log(`Progress: ${msg.payload.progress}% - ${msg.payload.message}`);
break;
case 'job.completed':
console.log('Import completed:', msg.payload.result);
break;
case 'job.failed':
console.error('Import failed:', msg.payload.error);
break;
}
};
Import Parameters¶
Import Modes¶
| Mode | Description |
|---|---|
full |
Full import of entire codebase |
selective |
Import only specified paths (include_paths) |
incremental |
Import only changes since last import |
Cloning Options¶
| Parameter | Default | Description |
|---|---|---|
shallow_clone |
true |
Use shallow clone |
shallow_depth |
1 |
Shallow clone depth |
branch |
"main" |
Branch to clone |
GoCPG Options¶
| Parameter | Default | Description |
|---|---|---|
gocpg_memory_gb |
16 |
Memory for GoCPG (GB) |
batch_size |
10000 |
Batch size for DuckDB export |
use_docker |
false |
Use Docker for GoCPG |
docker_image |
codegraph/gocpg:latest |
GoCPG Docker image |
Documentation Options¶
| Parameter | Default | Description |
|---|---|---|
import_docs |
true |
Import documentation |
import_readme |
true |
Index README files |
import_comments |
true |
Import code comments |
Import Result¶
After successful import, the following are created:
workspace/
├── postgres/ # Source code
├── postgres.duckdb # GoCPG CPG database
└── postgres.duckdb # DuckDB database (graph)
chromadb_storage/
└── postgres_documentation/ # ChromaDB collection
src/domains/
└── postgres/ # Domain Plugin
├── __init__.py
├── plugin.py
├── subsystems.yaml
└── prompts.yaml
Result Structure (ProjectImportResult)¶
{
"cpg_path": "./workspace/postgres.cpg",
"duckdb_path": "./workspace/postgres.duckdb",
"domain_plugin_path": "./src/domains/postgres",
"chromadb_collection": "postgres_documentation",
"chromadb_stats": {
"readme_indexed": 45,
"docs_indexed": 230,
"comments_indexed": 1500
},
"cpg_stats": {
"methods": 125000,
"calls": 450000,
"identifiers": 890000
},
"source_code_stats": {
"files_imported": 6307,
"files_skipped_size": 12,
"total_size_mb": 84.95
},
"validation_report": {
"status": "passed",
"quality_score": 85
},
"detected_language": "c",
"import_duration_seconds": 3600.5
}
CPG Validation¶
Quality Score (0-100)¶
Quality assessment of the imported CPG:
| Criterion | Points |
|---|---|
| Methods found | +50 |
| Files linked to methods (>50%) | +20 |
| AST edges present | +8 |
| CFG edges present | +7 |
| No validation errors | +15 |
Checked Metrics¶
methods_exist- number of methodscalls_exist- number of callsedges_ast- AST edgesedges_cfg- CFG edgesmethods_with_files- methods linked to files
Source Code Import¶
The SourceContentStep imports full source code content into nodes_file.content for code navigation and analysis.
How It Works¶
- Reads files from
source_pathspecified in project configuration - Populates
nodes_file.contentwith full file contents - Automatically normalizes file paths for JOIN compatibility with
nodes_method - Detects programming language from file extension
Supported File Extensions¶
| Language | Extensions |
|---|---|
| C/C++ | .c, .h, .cpp, .hpp, .cc, .cxx |
| Python | .py, .pyw |
| Java | .java |
| JavaScript/TypeScript | .js, .jsx, .ts, .tsx |
| Go | .go |
| Rust | .rs |
| PHP | .php |
| C# | .cs |
| Kotlin | .kt, .kts |
| 1C:Enterprise | .bsl, .os |
| Scala | .scala |
| SQL | .sql |
| Shell | .sh, .bash |
| Config | .yaml, .yml, .json, .xml, .toml, .ini |
File Size Limit¶
Files larger than 500 KB are skipped to keep database size manageable. This limit covers most source files while excluding large generated or binary files.
Path Normalization¶
File paths in nodes_file.name are automatically normalized to match nodes_method.filename format. Common prefixes like src/ are stripped to enable direct JOINs:
-- Get method source code by line number
SELECT
m.full_name,
m.line_number,
m.line_number_end,
f.content
FROM nodes_method m
JOIN nodes_file f ON REPLACE(m.filename, '/', '\') = REPLACE(f.name, '/', '\')
WHERE m.full_name = 'exec_simple_query';
Import Statistics¶
After import, the following statistics are available:
| Metric | Description |
|---|---|
source_files_imported |
Number of files successfully imported |
source_files_skipped_size |
Files skipped due to size limit |
source_files_skipped_not_found |
Files not found in source path |
source_files_total |
Total files processed |
Domain Plugin¶
A plugin is automatically generated for working with the new project.
Plugin Structure¶
# src/domains/my_project/plugin.py
class MyProjectPlugin(DomainPlugin):
@property
def name(self) -> str:
return "my_project"
@property
def display_name(self) -> str:
return "My Project"
def _load_subsystems(self) -> Dict[str, SubsystemInfo]:
# Load from subsystems.yaml
...
def get_vulnerability_function_mappings(self) -> Dict[str, List[str]]:
return {
"buffer_overflow": ["strcpy", "memcpy", ...],
"sql_injection": [...],
...
}
Configuration: subsystems.yaml¶
subsystems:
core:
description: "Core application logic"
key_functions:
- main
- init
- start
patterns:
- "src"
- "lib"
related_files: []
utils:
description: "Utility functions"
key_functions: []
patterns:
- "util"
- "helper"
Configuration: prompts.yaml¶
prompts:
onboarding:
system: |
You are a My Project expert helping developers understand the codebase.
user_template: |
Help me understand the following aspect: {query}
security:
system: |
You are a security expert analyzing My Project (C) code.
user_template: |
Analyze the following code for security vulnerabilities:
{code}
Activating Domain Plugin¶
After creating the plugin, add it to the configuration:
# config.yaml
domains:
active: "my_project"
available:
- postgresql_v2
- my_project
Or programmatically:
from src.domains import DomainRegistry
DomainRegistry.activate("my_project")
Handling Large Repositories¶
Large C/C++ Projects¶
Note: Large C/C++ projects use the
generic_cppdomain plugin for analysis.
# Use shallow clone
python -m src.cli.import_commands full \
--repo https://github.com/postgres/postgres \
--shallow \
--depth 1
# Or selective import
python -m src.cli.import_commands full \
--repo https://github.com/postgres/postgres \
--include src/backend/executor \
--mode selective
# Increase memory for GoCPG
python -m src.cli.import_commands full \
--repo https://github.com/postgres/postgres \
--memory 32
Recommendations¶
- Use shallow clone to save space and time
- Select needed directories via
--include - Exclude tests via
--exclude test tests - Increase GoCPG memory for large projects (16-32GB)
Python API¶
from src.project_import import (
ProjectImportPipeline,
ProjectImportRequest,
SupportedLanguage,
ImportMode,
)
# Create request
request = ProjectImportRequest(
repo_url="https://github.com/example/project",
branch="main",
shallow_clone=True,
language=SupportedLanguage.JAVA, # or None for auto-detection
mode=ImportMode.FULL,
include_paths=["src/main"],
exclude_paths=["src/test"],
create_domain_plugin=True,
import_docs=True,
)
# Run pipeline
async def run_import():
def progress_callback(status):
print(f"Progress: {status.overall_progress}% - {status.current_step}")
pipeline = ProjectImportPipeline(progress_callback=progress_callback)
result = await pipeline.run(request)
print(f"CPG: {result.cpg_path}")
print(f"DuckDB: {result.duckdb_path}")
print(f"Language: {result.detected_language}")
print(f"Quality Score: {result.validation_report['quality_score']}")
import asyncio
asyncio.run(run_import())
Running Individual Steps¶
from src.project_import.pipeline import ProjectImportPipeline
pipeline = ProjectImportPipeline()
# Step context
context = {
"request": ProjectImportRequest(),
"source_path": Path("./workspace/project"),
"duckdb_path": "./workspace/project.duckdb",
}
# Run validation step
result = await pipeline.run_step("validate", context)
print(result["validation_report"])
Troubleshooting¶
GoCPG Frontend Not Found¶
RuntimeError: Frontend not found at expected paths
Solution: Check GOCPG_HOME or specify explicitly:
export GOCPG_PATH=/path/to/gocpg
python -m src.cli.import_commands full --repo ...
GoCPG Process Failure¶
Error: GoCPG binary exits with non-zero code
Solution:
- Check available disk space for DuckDB output
- Verify source path is accessible: ls <source_path>
- Run with verbose logging: gocpg parse --input=<path> --output=<db> --lang=c -v
- Increase memory allocation if processing a large codebase: --memory 32
Language Not Detected¶
ValueError: No supported source files found
Solution: Specify language explicitly:
python -m src.cli.import_commands full --repo ... --language java
CPG Validation Failed¶
Validation errors: ['methods_exist: expected >= 1, got 0']
Solution: Check: 1. Source code path is correct 2. GoCPG frontend matches the language 3. Files are not excluded by patterns
Configuration (config.yaml)¶
Settings for the project_import module in config.yaml:
project_import:
gocpg:
# Path to local GoCPG binary (optional if using Docker)
home: ${GOCPG_HOME}
# Use Docker instead of local GoCPG
use_docker: false
# Docker image for GoCPG
docker_image: "codegraph/gocpg:latest"
# Memory limit (GB)
memory_gb: 16
workspace:
# Directory for cloned repositories
clone_dir: "./workspace"
# Directory for CPG files
cpg_dir: "./workspace"
# Directory for DuckDB files
duckdb_dir: "./workspace"
defaults:
# Default shallow clone depth
shallow_depth: 1
# Default exclusion patterns
exclude_patterns:
- "node_modules"
- "venv"
- ".venv"
- "__pycache__"
- ".git"
- "test"
- "tests"
- "vendor"
- "third_party"
Component Architecture¶
ProjectRegistry¶
Project registry in PostgreSQL:
from src.project_import import ProjectRegistry
async with ProjectRegistry() as registry:
# List projects
projects = await registry.list_projects()
# Activate project
await registry.set_active_project("my_project")
# Delete project
await registry.delete_project("old_project", delete_files=True)
GoCPGClient (Unified Wrapper)¶
As an alternative to direct subprocess calls via runners, the GoCPGClient provides a unified async Python wrapper for all GoCPG commands with Pydantic result models:
from src.services.gocpg import GoCPGClient
client = GoCPGClient() # auto-detects binary path from config.yaml
result = await client.parse(input_path="/src", output_path="data/projects/postgres.duckdb", language="c")
result = await client.update(input_path="/src", output_path="data/projects/postgres.duckdb", force=True)
ci_result = await client.ci_update(input_path="/src", output_path="data/projects/postgres.duckdb", base_ref="origin/main")
stats = await client.stats()
See src/services/gocpg/ for the full client API (17 async methods, 12 Pydantic models).
LocalGoCPGRunner / DockerGoCPGRunner¶
Runners for executing GoCPG commands:
# Local execution
from src.project_import import LocalGoCPGRunner
runner = LocalGoCPGRunner(gocpg_path="/path/to/gocpg")
# Docker execution
from src.project_import import DockerGoCPGRunner
runner = DockerGoCPGRunner(image="codegraph/gocpg:latest")
# Run parse
await runner.run_parse(source_path, output_db, language="python")
See Also¶
- REST API Documentation - HTTP API endpoints
- API Reference - Python API
- Scenarios Guide - Analysis scenarios