DLP: Data Loss Prevention¶

Technical Documentation for Security and Compliance Teams

Table of Contents¶

Overview
Key Capabilities
Architecture
Processing Pipeline
Detection Categories
1. Credentials — HIGH Severity
2. PII (Personal Identifiable Information) — MEDIUM Severity
3. Source Code — LOW Severity
DLP Actions
Priority Hierarchy
Action Selection Logic
Configuration
Basic Configuration (config.yaml)
Adding Custom Patterns
API Reference
ContentScanner
ScanResult
DLPMatch
SIEM Integration
DLP Events
Event Types
Webhook Integration
External DLP Request Format
Best Practices
For Configuration
For Developers
For Compliance
Metrics and Monitoring
Prometheus Metrics
Grafana Dashboard
Related Documents

Overview¶

The DLP (Data Loss Prevention) module in CodeGraph protects against sensitive data leakage when working with LLMs. The system scans both incoming user requests and outgoing LLM responses.

Key Capabilities¶

25+ patterns for sensitive data detection
Pre-request scanning — blocking before sending to LLM
Post-response scanning — masking in LLM responses
4 action modes: BLOCK, MASK, WARN, LOG_ONLY
SIEM integration — real-time event dispatch
Webhook notifications — integration with external DLP systems

Architecture¶

Processing Pipeline¶

┌─────────────────────────────────────────────────────────────────────────┐
│                              USER                                        │
│                              │                                          │
│                              ▼                                          │
│  ┌───────────────────────────────────────────────────────────────────┐ │
│  │                    PRE-REQUEST SCANNER                             │ │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐               │ │
│  │  │ Credentials │  │    PII      │  │ Source Code │               │ │
│  │  │   (HIGH)    │  │  (MEDIUM)   │  │    (LOW)    │               │ │
│  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘               │ │
│  │         │                │                │                       │ │
│  │         └────────────────┴────────────────┘                       │ │
│  │                          │                                        │ │
│  │              ┌───────────┴───────────┐                           │ │
│  │              ▼           ▼           ▼                           │ │
│  │          [BLOCK]     [MASK]     [WARN/LOG]                       │ │
│  │              │           │           │                           │ │
│  │              │     ┌─────┴─────┐     │                           │ │
│  │              │     │Replace w/ │     │                           │ │
│  │              │     │[REDACTED] │     │                           │ │
│  │              │     └─────┬─────┘     │                           │ │
│  └──────────────┼───────────┼───────────┼────────────────────────────┘ │
│                 │           │           │                              │
│                 │      ┌────┴────┐      │                              │
│                 │      ▼         ▼      │                              │
│           SIEM Event  GigaChat/Yandex   │                              │
│                               │         │                              │
│                               ▼         │                              │
│  ┌───────────────────────────────────────────────────────────────────┐ │
│  │                   POST-RESPONSE SCANNER                           │ │
│  │                          │                                        │ │
│  │              Mask sensitive data in                               │ │
│  │              LLM response                                         │ │
│  └───────────────────────────────────────────────────────────────────┘ │
│                              │                                          │
│                              ▼                                          │
│                       USER RESPONSE                                     │
└─────────────────────────────────────────────────────────────────────────┘

Detection Categories¶

1. Credentials — HIGH Severity¶

Pattern	Regex	Description
`api_key_generic`	`(?i)(api[_-]?key\\|apikey)["\s:=]+["\']?([a-zA-Z0-9_\-]{20,})["\']?`	Generic API key
`aws_access_key`	`AKIA[0-9A-Z]{16}`	AWS Access Key ID
`aws_secret_key`	`(?i)aws[_\s]secret[_\s]access[_\s]*key["\s:=]+["\']?([a-zA-Z0-9/+=]{40})["\']?`	AWS Secret Access Key
`private_key`	`-----BEGIN (RSA \\|EC \\|OPENSSH \\|DSA )?PRIVATE KEY-----`	Private key (RSA, EC, DSA)
`password_pattern`	`(?i)(password\\|passwd\\|pwd)["\s:=]+["\']?([^\s"\']{8,})["\']?`	Password in config/code
`bearer_token`	`(?i)bearer\s+[a-zA-Z0-9\-_]+\.[a-zA-Z0-9\-_]+\.[a-zA-Z0-9\-_]+`	JWT Bearer token
`github_token`	`gh[pousr]_[A-Za-z0-9_]{36,}`	GitHub Personal Access Token

Default Action: BLOCK

2. PII (Personal Identifiable Information) — MEDIUM Severity¶

Pattern	Regex	Mask	Description
`email`	`[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`	`[EMAIL]`	Email address
`phone_ru`	`(\+7\\|8)?[\s\-]?\(?\d{3}\)?[\s\-]?\d{3}[\s\-]?\d{2}[\s\-]?\d{2}`	`[PHONE]`	Russian phone
`phone_us`	`(\+1)?[\s.-]?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}`	`[PHONE]`	US phone
`ssn`	`\b\d{3}-\d{2}-\d{4}\b`	`[SSN]`	US SSN
`credit_card`	`\b(?:\d{4}[\s-]?){3}\d{4}\b`	`[CARD]`	Credit card
`ip_address`	`\b(?:\d{1,3}\.){3}\d{1,3}\b`	`[IP]`	IPv4 address
`passport_ru`	`\b\d{2}\s?\d{2}\s?\d{6}\b`	`[PASSPORT]`	Russian passport

Default Action: MASK

3. Source Code — LOW Severity¶

Pattern	Regex	Mask	Description
`connection_string`	`(?i)(jdbc\\|mysql\\|postgresql\\|mongodb\\|redis)://[^\s"\'<>]+`	`[CONN_STRING]`	DB connection string
`internal_path_unix`	`(/home/\\|/var/\\|/etc/\\|/opt/)[^\s"\'<>\\|]+`	`[PATH]`	Unix path
`internal_path_windows`	`[A-Z]:\\(Users\\|Windows\\|Program)[^\s"\'<>\\|]*`	`[PATH]`	Windows path

Default Action: WARN

DLP Actions¶

Priority Hierarchy¶

Priority	Action	Description	Result
4 (max)	`BLOCK`	Complete blocking	Request rejected, error to client
3	`MASK`	Masking	Data replaced with `[REDACTED]`
2	`WARN`	Warning	Request allowed, SIEM event
1	`LOG_ONLY`	Logging only	Audit log entry

Action Selection Logic¶

# With multiple matches, highest priority is selected
action_priority = {
    DLPAction.BLOCK: 4,
    DLPAction.MASK: 3,
    DLPAction.WARN: 2,
    DLPAction.LOG_ONLY: 1,
}

# Example: found API key (BLOCK) and email (MASK)
# Result: BLOCK (priority 4 > 3)

Configuration¶

Basic Configuration (config.yaml)¶

security:
  enabled: true

  dlp:
    enabled: true

    # Pre-request scanning (before sending to LLM)
    pre_request:
      enabled: true
      default_action: WARN

    # Post-response scanning (after receiving from LLM)
    post_response:
      enabled: true
      default_action: MASK

    # Categories and patterns
    categories:
      credentials:
        enabled: true
        action: BLOCK
        patterns: []  # Uses default patterns

      pii:
        enabled: true
        action: MASK
        patterns:
          - name: custom_ssn_pattern
            regex: '\b\d{3}-\d{2}-\d{4}\b'
            mask_with: '[SSN]'
            description: 'Social Security Number'

      source_code:
        enabled: true
        action: WARN
        patterns: []

    # Custom keywords
    keywords:
      sensitive_terms:
        words:
          - "confidential"
          - "top secret"
          - "internal use only"
        case_sensitive: false

    keywords_action: LOG_ONLY

    # Webhook for external DLP system
    webhook:
      enabled: false
      endpoint: "https://dlp.company.com/api/scan"
      auth_header: "Bearer ${DLP_WEBHOOK_TOKEN}"
      timeout_seconds: 10
      notify_on:
        - BLOCK
        - WARN

Adding Custom Patterns¶

from src.security.dlp.patterns import PatternRegistry
from src.security.config import DLPAction

registry = PatternRegistry(config)

# Add custom pattern
registry.add_pattern(
    category="pii",
    name="employee_id",
    regex=r'EMP-\d{6}',
    action=DLPAction.MASK,
    mask_with="[EMPLOYEE_ID]"
)

# Add keywords
registry.add_keywords(
    list_name="internal_projects",
    keywords=["Project Alpha", "Codename Beta"],
    case_sensitive=False
)

API Reference¶

ContentScanner¶

from src.security.dlp.scanner import ContentScanner, DLPBlockedException
from src.security.config import get_security_config

# Initialize scanner
config = get_security_config()
scanner = ContentScanner(config.dlp)

# Pre-request scanning
result = scanner.scan_request(user_prompt)

if result.blocked:
    # Request blocked
    raise DLPBlockedException(result.matches)
elif result.action == DLPAction.MASK:
    # Use masked content
    user_prompt = result.modified_content

# Send to LLM...
llm_response = await llm_client.complete(user_prompt)

# Post-response scanning
result = scanner.scan_response(llm_response)
if result.has_matches:
    # Mask response
    llm_response = result.modified_content

return llm_response

ScanResult¶

@dataclass
class ScanResult:
    has_matches: bool           # Whether matches were found
    matches: List[DLPMatch]     # List of matches
    action: DLPAction           # Recommended action
    modified_content: str       # Masked content
    blocked: bool               # Whether to block request

DLPMatch¶

@dataclass
class DLPMatch:
    category: str          # Category (credentials, pii, source_code)
    pattern_name: str      # Pattern name
    match_type: MatchType  # REGEX or KEYWORD
    matched_text: str      # Found text
    start: int             # Start position
    end: int               # End position
    action: DLPAction      # Action for this match
    mask_with: str         # Replacement text
    severity: str          # Severity (critical, high, medium, low)

SIEM Integration¶

DLP Events¶

Each DLP trigger sends an event to SIEM:

{
  "event_type": "DLP_BLOCK",
  "timestamp": "2026-02-26T10:30:00.000Z",
  "severity": "CRITICAL",
  "user_id": "user_123",
  "session_id": "sess_456",
  "category": "credentials",
  "pattern_name": "aws_access_key",
  "action_taken": "BLOCK",
  "match_count": 1,
  "masked_preview": "Found AWS key: AKIA...",
  "request_path": "/api/v1/scenarios/execute",
  "ip_address": "10.0.0.50"
}

Event Types¶

Event	Severity	Description
`DLP_BLOCK`	CRITICAL	Request blocked
`DLP_MASK`	WARNING	Data masked
`DLP_WARN`	WARNING	Warning issued
`DLP_LOG`	INFO	Logging only

Webhook Integration¶

External DLP Request Format¶

POST /api/scan HTTP/1.1
Host: dlp.company.com
Authorization: Bearer xxx
Content-Type: application/json

{
  "scan_id": "scan_789",
  "timestamp": "2026-02-26T10:30:00.000Z",
  "content_type": "request",
  "user_id": "user_123",
  "matches": [
    {
      "category": "credentials",
      "pattern_name": "aws_access_key",
      "severity": "high",
      "action": "BLOCK"
    }
  ],
  "action_taken": "BLOCK"
}

Best Practices¶

For Configuration¶

Start with WARN — set all categories to WARN to assess false positives
Gradually tighten — move critical categories to BLOCK after testing
Configure exceptions — add whitelist for known safe patterns
Monitor SIEM — track trigger statistics

For Developers¶

Don’t log matched_text in production without masking
Handle DLPBlockedException in API — return clear error message
Test patterns — use regex101.com for validation

For Compliance¶

Document patterns — maintain registry of patterns in use
Regular audit — verify detection effectiveness
GDPR compliance — ensure PII patterns cover required data types

Metrics and Monitoring¶

Prometheus Metrics¶

# Block count
rate(dlp_blocks_total[5m])

# Mask count
rate(dlp_masks_total[5m])

# Distribution by category
dlp_matches_total{category="credentials"}
dlp_matches_total{category="pii"}

# Average scan time
histogram_quantile(0.95, rate(dlp_scan_duration_seconds_bucket[5m]))

Grafana Dashboard¶

{
  "panels": [
    {
      "title": "DLP Blocks per Minute",
      "query": "rate(dlp_blocks_total[1m])"
    },
    {
      "title": "Top Triggered Patterns",
      "query": "topk(10, dlp_matches_total)"
    }
  ]
}

Enterprise Security Brief — Security overview
SIEM Integration — SIEM integration
LLM Security — LLM security

Version: 1.1 | February 2026

DLP: Data Loss Prevention

DLP: Data Loss Prevention¶

Table of Contents¶

Overview¶

Key Capabilities¶

Architecture¶

Processing Pipeline¶

Detection Categories¶

1. Credentials — HIGH Severity¶

2. PII (Personal Identifiable Information) — MEDIUM Severity¶

3. Source Code — LOW Severity¶

DLP Actions¶

Priority Hierarchy¶

Action Selection Logic¶

Configuration¶

Basic Configuration (config.yaml)¶

Adding Custom Patterns¶

API Reference¶

ContentScanner¶

ScanResult¶

DLPMatch¶

SIEM Integration¶

DLP Events¶

Event Types¶

Webhook Integration¶

External DLP Request Format¶

Best Practices¶

For Configuration¶

For Developers¶

For Compliance¶

Metrics and Monitoring¶

Prometheus Metrics¶

Grafana Dashboard¶

Related Documents¶