Guide to importing new projects into the CodeGraph system.
Note: This guide covers creating new CPG data from source code. For using existing CPG data, simply register the project under
projects.registry.<name>.db_pathinconfig.yaml.
Table of Contents¶
- Overview
- Supported Languages
- CLI Usage
- Full Pipeline (Single Command)
- Docker Support
- Project Management
- Import Jobs
- Import Comments
- Step-by-Step Import
- List Supported Languages
- REST API Usage
- Get List of Supported Languages
- Start Import (Asynchronous)
- Check Import Status
- List All Import Jobs
- Cancel Import
- Run Individual Step
- Import with Docker
- Project Management (REST)
- WebSocket for Progress Tracking
- Import Parameters
- Import Modes
- Cloning Options
- GoCPG Options
- Documentation Options
- Import Result
- Result Structure (ProjectImportResult)
- CPG Validation
- Quality Score (0-100)
- Checked Metrics
- Source Code Import
- How It Works
- Supported File Extensions
- File Size Limit
- Path Normalization
- Import Statistics
- Domain Plugin
- Plugin Structure
- Configuration: subsystems.yaml
- Configuration: prompts.yaml
- Activating Domain Plugin
- Handling Large Repositories
- Large C/C++ Projects
- Recommendations
- Python API
- Running Individual Steps
- Troubleshooting
- GoCPG Frontend Not Found
- GoCPG Process Failure
- Language Not Detected
- CPG Validation Failed
- Configuration (config.yaml)
- Component Architecture
- ProjectRegistry
- GoCPGClient
- See Also
Overview¶
The system supports automatic import of codebases with various programming languages. The process includes 8 pipeline steps:
- Clone — repository cloning
- Detect Language — programming language detection
- Create CPG — Code Property Graph creation (GoCPG outputs DuckDB directly)
- Validate — CPG integrity validation
- ChromaDB Sync — documentation indexing into ChromaDB
- Doc Generation — auto-generate documentation from CPG
- Vector Index — index vector collections for retrieval
- Domain Setup — Domain Plugin generation
Supported Languages¶
| Language | File Extensions | Description |
|---|---|---|
| C/C++ | .c, .h, .cpp, .hpp, .cc, .cxx |
C/C++ source code |
| C# | .cs |
C# source code |
| Go | .go |
Go source code |
| Java | .java |
Java source code |
| JavaScript | .js, .jsx, .mjs |
JavaScript source code |
| TypeScript | .ts, .tsx |
TypeScript (dedicated GoCPG frontend) |
| Kotlin | .kt, .kts |
Kotlin source code |
| PHP | .php |
PHP source code |
| Python | .py, .pyw |
Python source code |
| 1C:Enterprise | .bsl, .os |
1C:Enterprise (BSL/SDBL), GoCPG frontend only |
CLI Usage¶
Full Pipeline (Single Command)¶
# Import from GitHub repository
python -m src.cli.import_commands full \
--repo https://github.com/postgres/postgres \
--branch master \
--shallow \
--language c
# Import local project
python -m src.cli.import_commands full \
--path /path/to/project \
--language java
# With selective import (only specific directories)
python -m src.cli.import_commands full \
--repo https://github.com/postgres/postgres \
--include src/backend src/include \
--exclude test tests
# Import using Docker
python -m src.cli.import_commands full \
--repo https://github.com/example/project \
--docker
Docker Support¶
The system supports running GoCPG in a Docker container for cross-platform operation:
# Import with Docker (no local GoCPG build required)
python -m src.cli.import_commands full \
--repo https://github.com/example/project \
--docker
# With specific Docker image
python -m src.cli.import_commands full \
--repo https://github.com/example/project \
--docker \
--docker-image codegraph/gocpg:v4.0.0
Docker Advantages: - No local GoCPG build required - Consistent behavior across all platforms (Windows, Linux, macOS) - Isolated execution environment - Automatic resource management
Project Management¶
# List all imported projects
python -m src.cli.import_commands projects list
# Project information
python -m src.cli.import_commands projects info my_project
# Activate project (set as current)
# When a project has a `domain` field, the corresponding domain plugin is activated automatically.
python -m src.cli.import_commands projects activate my_project
# Rename project
python -m src.cli.import_commands projects rename my_project new_name
python -m src.cli.import_commands projects rename my_project new_name --group my_group
# Delete project (metadata only)
python -m src.cli.import_commands projects delete my_project
# Delete project with files (CPG, DuckDB)
python -m src.cli.import_commands projects delete my_project --delete-files
# Delete project with ChromaDB collections
python -m src.cli.import_commands projects delete my_project --delete-collections
# Delete without confirmation prompt
python -m src.cli.import_commands projects delete my_project --delete-files --delete-collections -y
All project commands support --group to specify the project group.
Projects are registered in config.yaml under projects.registry:
projects:
active: postgres
registry:
postgres:
db_path: data/projects/postgres.duckdb
source_path: /path/to/source
language: c
domain: postgresql_v2 # Auto-activates domain plugin on switch
my_python_app:
db_path: data/projects/myapp.duckdb
source_path: /path/to/myapp
language: python
domain: python_generic
The domain field is optional. When set, switching to a project automatically activates the corresponding domain plugin (e.g., postgresql_v2, python_generic). ChromaDB vector collections are also isolated per project.
Import Jobs¶
# List recent import jobs
python -m src.cli.import_commands jobs
# With filters
python -m src.cli.import_commands jobs --limit 20 --status completed
python -m src.cli.import_commands jobs --status failed
Import Comments¶
# Import source code comments into existing DuckDB
python -m src.cli.import_commands import-comments --db data/projects/postgres.duckdb
# With explicit source path
python -m src.cli.import_commands import-comments --db data/projects/postgres.duckdb --source /path/to/source
Step-by-Step Import¶
# 1. Clone repository
python -m src.cli.import_commands clone \
--repo https://github.com/org/repo \
--branch main \
--shallow \
--depth 1
# 2. Detect language
python -m src.cli.import_commands detect --path ./workspace/repo
# 3. Create CPG (outputs DuckDB directly)
python -m src.cli.import_commands cpg \
--path ./workspace/repo \
--language c
# 4. Validate
python -m src.cli.import_commands validate --db ./workspace/repo.duckdb
# 5. Import documentation
python -m src.cli.import_commands docs \
--path ./workspace/repo \
--db ./workspace/repo.duckdb
# 6. Create Domain Plugin
python -m src.cli.import_commands domain \
--path ./workspace/repo \
--name my_project \
--db ./workspace/repo.duckdb
List Supported Languages¶
python -m src.cli.import_commands languages
REST API Usage¶
Get List of Supported Languages¶
GET /api/v1/import/languages
Response:
{
"languages": [
{
"id": "c",
"name": "C",
"extensions": [".c", ".h", ".cpp", ".hpp"],
"gocpg_frontend": "c",
"gocpg_lang": "C"
},
{
"id": "java",
"name": "JAVA",
"extensions": [".java"],
"gocpg_frontend": "java",
"gocpg_lang": "JAVA"
}
]
}
Start Import (Asynchronous)¶
POST /api/v1/import/start
Content-Type: application/json
{
"repo_url": "https://github.com/postgres/postgres",
"branch": "master",
"shallow_clone": true,
"language": null,
"mode": "full",
"include_paths": ["src/backend", "src/include"],
"exclude_paths": ["test", "tests"],
"create_domain_plugin": true,
"import_docs": true
}
Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "pending",
"message": "Import started. Use job_id to track progress."
}
Check Import Status¶
GET /api/v1/import/status/{job_id}
Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"project_name": "postgres",
"status": "in_progress",
"steps": [
{"name": "Clone Repository", "status": "completed", "progress": 100},
{"name": "Detect Language", "status": "completed", "progress": 100},
{"name": "Generate CPG", "status": "in_progress", "progress": 45, "message": "Creating CPG nodes..."},
{"name": "Validate CPG", "status": "pending", "progress": 0},
{"name": "Sync Documentation", "status": "pending", "progress": 0},
{"name": "Generate Documentation", "status": "pending", "progress": 0},
{"name": "Index Vector Collections", "status": "pending", "progress": 0},
{"name": "Setup Domain Plugin", "status": "pending", "progress": 0}
],
"current_step": "gocpg_parse",
"overall_progress": 35,
"created_at": "2024-12-09T10:00:00Z",
"updated_at": "2024-12-09T10:05:00Z"
}
List All Import Jobs¶
GET /api/v1/import/jobs?status_filter=in_progress&limit=10
Cancel Import¶
DELETE /api/v1/import/cancel/{job_id}
Run Individual Step¶
POST /api/v1/import/step
Content-Type: application/json
{
"step_id": "validate",
"context": {
"duckdb_path": "./workspace/project.duckdb"
}
}
Import with Docker¶
POST /api/v1/import/start
Content-Type: application/json
{
"repo_url": "https://github.com/example/project",
"branch": "main",
"use_docker": true,
"docker_image": "codegraph/gocpg:latest"
}
Project Management (REST)¶
List projects:
GET /api/v1/projects
Response:
{
"projects": [
{
"id": "123",
"name": "my_project",
"language": "python",
"cpg_path": "./workspace/my_project.cpg",
"duckdb_path": "./workspace/my_project.duckdb",
"is_active": true,
"created_at": "2024-12-10T10:00:00Z"
}
]
}
Create project:
POST /api/v1/projects
Content-Type: application/json
{
"group_id": "group-uuid",
"name": "my_project",
"db_path": "./data/projects/my_project.duckdb",
"language": "python",
"domain": "python_generic"
}
Activate project:
POST /api/v1/projects/{project_id}/activate
Delete project:
DELETE /api/v1/projects/{project_id}
Delete collections:
DELETE /api/v1/projects/{project_id}/collections
WebSocket for Progress Tracking¶
const ws = new WebSocket('ws://localhost:8000/api/v1/ws/jobs/550e8400-e29b-41d4-a716-446655440000');
ws.onmessage = (event) => {
const msg = JSON.parse(event.data);
switch (msg.type) {
case 'job.progress':
console.log(`Progress: ${msg.payload.progress}% - ${msg.payload.message}`);
break;
case 'job.completed':
console.log('Import completed:', msg.payload.result);
break;
case 'job.failed':
console.error('Import failed:', msg.payload.error);
break;
}
};
Import Parameters¶
Import Modes¶
| Mode | Description |
|---|---|
full |
Full import of entire codebase |
selective |
Import only specified paths (include_paths) |
incremental |
Import only changes since last import |
Cloning Options¶
| Parameter | Default | Description |
|---|---|---|
shallow_clone |
true |
Use shallow clone |
shallow_depth |
1 |
Shallow clone depth |
branch |
"main" |
Branch to clone |
GoCPG Options¶
| Parameter | Default | Description |
|---|---|---|
gocpg_memory_gb |
16 |
Memory for GoCPG (GB) |
batch_size |
10000 |
Batch size for DuckDB export |
use_docker |
false |
Use Docker for GoCPG |
docker_image |
codegraph/gocpg:latest |
GoCPG Docker image |
Documentation Options¶
| Parameter | Default | Description |
|---|---|---|
import_docs |
true |
Import documentation |
import_readme |
true |
Index README files |
import_comments |
true |
Import code comments |
Import Result¶
After successful import, the following are created:
workspace/
├── postgres/ # Source code
└── postgres.duckdb # DuckDB database (CPG + graph)
chroma_db/
└── postgres_documentation/ # ChromaDB collection
src/domains/
└── postgres/ # Domain Plugin
├── __init__.py
├── plugin.py
├── subsystems.yaml
└── prompts.yaml
Result Structure (ProjectImportResult)¶
{
"cpg_path": "./workspace/postgres.cpg",
"duckdb_path": "./workspace/postgres.duckdb",
"domain_plugin_path": "./src/domains/postgres",
"chromadb_collection": "postgres_documentation",
"chromadb_stats": {
"readme_indexed": 45,
"docs_indexed": 230,
"comments_indexed": 1500
},
"cpg_stats": {
"methods": 125000,
"calls": 450000,
"identifiers": 890000
},
"source_info": {
"files_imported": 6307,
"files_skipped_size": 12,
"total_size_mb": 84.95
},
"validation_report": {
"status": "passed",
"quality_score": 85
},
"detected_language": "c",
"import_duration_seconds": 3600.5
}
CPG Validation¶
Quality Score (0-100)¶
Quality assessment of the imported CPG:
| Criterion | Points |
|---|---|
| Methods found | +50 |
| Files linked to methods (>50%) | +20 |
| AST edges present | +8 |
| CFG edges present | +7 |
| No validation errors | +15 |
Checked Metrics¶
methods_exist- number of methodscalls_exist- number of callsedges_ast- AST edgesedges_cfg- CFG edgesmethods_with_files- methods linked to files
Source Code Import¶
The SourceContentStep imports full source code content into nodes_file.content for code navigation and analysis.
How It Works¶
- Reads files from
source_pathspecified in project configuration - Populates
nodes_file.contentwith full file contents - Automatically normalizes file paths for JOIN compatibility with
nodes_method - Detects programming language from file extension
Supported File Extensions¶
| Language | Extensions |
|---|---|
| C/C++ | .c, .h, .cpp, .hpp, .cc, .cxx |
| Python | .py, .pyw |
| Java | .java |
| JavaScript/TypeScript | .js, .jsx, .ts, .tsx |
| Go | .go |
| Rust | .rs |
| PHP | .php |
| C# | .cs |
| Kotlin | .kt, .kts |
| 1C:Enterprise | .bsl, .os |
| Scala | .scala |
| SQL | .sql |
| Shell | .sh, .bash |
| Config | .yaml, .yml, .json, .xml, .toml, .ini |
File Size Limit¶
Files larger than 500 KB are skipped to keep database size manageable. This limit covers most source files while excluding large generated or binary files.
Path Normalization¶
File paths in nodes_file.name are automatically normalized to match nodes_method.filename format. Common prefixes like src/ are stripped to enable direct JOINs:
-- Get method source code by line number
SELECT
m.full_name,
m.line_number,
m.line_number_end,
f.content
FROM nodes_method m
JOIN nodes_file f ON REPLACE(m.filename, '/', '\') = REPLACE(f.name, '/', '\')
WHERE m.full_name = 'exec_simple_query';
Import Statistics¶
After import, the following statistics are available:
| Metric | Description |
|---|---|
source_files_imported |
Number of files successfully imported |
source_files_skipped_size |
Files skipped due to size limit |
source_files_skipped_not_found |
Files not found in source path |
source_files_total |
Total files processed |
Domain Plugin¶
A plugin is automatically generated for working with the new project.
Plugin Structure¶
# src/domains/my_project/plugin.py
class MyProjectPlugin(DomainPlugin):
@property
def name(self) -> str:
return "my_project"
@property
def display_name(self) -> str:
return "My Project"
def _load_subsystems(self) -> Dict[str, SubsystemInfo]:
# Load from subsystems.yaml
...
def get_vulnerability_function_mappings(self) -> Dict[str, List[str]]:
return {
"buffer_overflow": ["strcpy", "memcpy", ...],
"sql_injection": [...],
...
}
Configuration: subsystems.yaml¶
subsystems:
core:
description: "Core application logic"
key_functions:
- main
- init
- start
patterns:
- "src"
- "lib"
related_files: []
utils:
description: "Utility functions"
key_functions: []
patterns:
- "util"
- "helper"
Configuration: prompts.yaml¶
prompts:
onboarding:
system: |
You are a My Project expert helping developers understand the codebase.
user_template: |
Help me understand the following aspect: {query}
security:
system: |
You are a security expert analyzing My Project (C) code.
user_template: |
Analyze the following code for security vulnerabilities:
{code}
Activating Domain Plugin¶
After creating the plugin, add it to the configuration:
# config.yaml
domains:
active: "my_project"
available:
- postgresql_v2
- my_project
Or programmatically:
from src.domains import DomainRegistry
DomainRegistry.activate("my_project")
Handling Large Repositories¶
Large C/C++ Projects¶
Note: Large C/C++ projects use the
generic_cppdomain plugin for analysis.
# Use shallow clone
python -m src.cli.import_commands full \
--repo https://github.com/postgres/postgres \
--shallow \
--depth 1
# Or selective import
python -m src.cli.import_commands full \
--repo https://github.com/postgres/postgres \
--include src/backend/executor \
--mode selective
# Increase memory for GoCPG
python -m src.cli.import_commands full \
--repo https://github.com/postgres/postgres \
--memory 32
Recommendations¶
- Use shallow clone to save space and time
- Select needed directories via
--include - Exclude tests via
--exclude test tests - Increase GoCPG memory for large projects (16-32GB)
Python API¶
from src.project_import import (
ProjectImportPipeline,
ProjectImportRequest,
SupportedLanguage,
ImportMode,
)
# Create request
request = ProjectImportRequest(
repo_url="https://github.com/example/project",
branch="main",
shallow_clone=True,
language=SupportedLanguage.JAVA, # or None for auto-detection
mode=ImportMode.FULL,
include_paths=["src/main"],
exclude_paths=["src/test"],
create_domain_plugin=True,
import_docs=True,
)
# Run pipeline
async def run_import():
def progress_callback(status):
print(f"Progress: {status.overall_progress}% - {status.current_step}")
pipeline = ProjectImportPipeline(progress_callback=progress_callback)
result = await pipeline.run(request)
print(f"CPG: {result.cpg_path}")
print(f"DuckDB: {result.duckdb_path}")
print(f"Language: {result.detected_language}")
print(f"Quality Score: {result.validation_report['quality_score']}")
import asyncio
asyncio.run(run_import())
Running Individual Steps¶
from src.project_import.pipeline import ProjectImportPipeline
pipeline = ProjectImportPipeline()
# Step context
context = {
"request": ProjectImportRequest(),
"source_path": Path("./workspace/project"),
"duckdb_path": "./workspace/project.duckdb",
}
# Run validation step
result = await pipeline.run_step("validate", context)
print(result["validation_report"])
Troubleshooting¶
GoCPG Frontend Not Found¶
RuntimeError: Frontend not found at expected paths
Solution: Check GOCPG_PATH environment variable or specify explicitly:
export GOCPG_PATH=/path/to/gocpg
python -m src.cli.import_commands full --repo ...
GoCPG Process Failure¶
Error: GoCPG binary exits with non-zero code
Solution:
- Check available disk space for DuckDB output
- Verify source path is accessible: ls <source_path>
- Run with verbose logging: gocpg parse --input=<path> --output=<db> --lang=c -v
- Increase memory allocation if processing a large codebase: --memory 32
Language Not Detected¶
ValueError: No supported source files found
Solution: Specify language explicitly:
python -m src.cli.import_commands full --repo ... --language java
CPG Validation Failed¶
Validation errors: ['methods_exist: expected >= 1, got 0']
Solution: Check: 1. Source code path is correct 2. GoCPG frontend matches the language 3. Files are not excluded by patterns
Configuration (config.yaml)¶
Settings for the project_import module in config.yaml:
project_import:
gocpg:
# Path to local GoCPG binary (optional if using Docker)
binary_path: ${GOCPG_PATH:-gocpg/gocpg.exe}
# Parse timeout in seconds
parse_timeout: 3600
# Update timeout in seconds
update_timeout: 600
# Use Docker instead of local GoCPG
use_docker: false
# Docker image for GoCPG
docker_image: "codegraph/gocpg:latest"
# Memory limit (GB)
memory_gb: 16
# Directory for DuckDB files
duckdb_path: ./data/projects
# Batch size for DuckDB export
batch_size: 10000
# Auto-detect incremental mode
auto_detect_incremental: true
# Default exclusion patterns
default_excludes:
- "node_modules"
- "venv"
- ".venv"
- "__pycache__"
- ".git"
- "test"
- "tests"
- "vendor"
- "third_party"
Component Architecture¶
ProjectRegistry¶
Project registry backed by PostgreSQL. Takes an AsyncSession in the constructor:
from sqlalchemy.ext.asyncio import AsyncSession
from src.project_import.registry import ProjectRegistry
# Used with an existing SQLAlchemy async session
registry = ProjectRegistry(session)
# CRUD operations
projects = await registry.list_projects()
project = await registry.get_project_by_name("my_project")
await registry.rename_project(project.id, "new_name")
await registry.delete_project(project.id, delete_files=True)
GoCPGClient¶
The GoCPGClient provides a unified async Python wrapper for all GoCPG commands with Pydantic result models:
from src.services.gocpg import GoCPGClient
client = GoCPGClient() # auto-detects binary path from config.yaml
result = await client.parse(input_path="/src", output_path="data/projects/postgres.duckdb", language="c")
result = await client.update(input_path="/src", output_path="data/projects/postgres.duckdb", force=True)
ci_result = await client.ci_update(input_path="/src", output_path="data/projects/postgres.duckdb", base_ref="origin/main")
stats = await client.stats()
See src/services/gocpg/ for the full client API (31 async methods, 28 Pydantic models).
See Also¶
- REST API Documentation - HTTP API endpoints
- API Reference - Python API
- Scenarios Guide - Analysis scenarios