How It Works

Overview

KafkaCode uses a dual-layer analysis approach combining pattern-based detection with AI-powered contextual analysis to identify privacy and compliance issues in your source code.

Architecture

File Discovery

The FileScanner component recursively scans your project directory, respecting .gitignore rules and excluding common directories like node_modules, .git, and build.Supported File Types:

Python (.py)
JavaScript (.js)
TypeScript (.ts)
Java (.java)
Go (.go)
Ruby (.rb)
PHP (.php)

Pattern-Based Analysis

The PatternScanner performs regex-based detection for known privacy issues:

Hardcoded secrets (AWS keys, API tokens)
PII patterns (emails, phone numbers)
Sensitive keywords in assignment contexts
High entropy strings

AI-Powered Analysis

The LLMAnalyzer uses advanced language models to:

Understand code context
Identify subtle privacy issues
Reduce false positives
Provide contextual recommendations

Report Generation

The ReportGenerator compiles findings and produces:

Privacy grade (A+ to F)
Severity classification
Line-by-line details
Actionable recommendations

Dual-Layer Detection

Layer 1: Pattern-Based Detection

Fast, deterministic scanning using regex patterns:

// Example: AWS Access Key Detection
const patterns = {
  awsAccessKey: /AKIA[0-9A-Z]{16}/g,
  privateKey: /-----BEGIN (RSA |EC )?PRIVATE KEY-----/,
  stripeKey: /sk_live_[0-9a-zA-Z]{24}/
};

Advantages:

⚡ Very fast execution
🎯 High precision for known patterns
📊 Zero external dependencies

Limitations:

May miss context-dependent issues
Can produce false positives
Limited to predefined patterns

Layer 2: AI-Powered Analysis

Contextual analysis using LLM:

// The LLM analyzes:
const context = {
  codeSnippet: fileContent,
  patternFindings: previousFindings,
  language: fileExtension
};

// Returns contextual insights
const aiFindings = await llm.analyze(context);

Advantages:

🧠 Understands code context
🔍 Finds subtle privacy issues
✨ Reduces false positives
💡 Provides smart recommendations

How it works:

Takes code snippet and pattern findings
Analyzes semantic meaning
Identifies privacy concerns in context
Suggests specific remediation steps

Severity Classification

Issues are classified into four severity levels:

Critical

Score: 100 points

Exposed API keys
Private keys
Database credentials
OAuth tokens

Action: Fix immediately

High

Score: 50 points

Sensitive data in code
Payment information
Authentication secrets
Personal identifiers

Action: Address soon

Medium

Score: 10 points

Email addresses
Phone numbers
High entropy strings
Potential secrets

Action: Review recommended

Low

Score: 1 point

IP addresses
URLs
Configuration values
Minor issues

Action: Optional review

Privacy Grading Algorithm

The privacy grade is calculated based on total severity score:

function calculateGrade(totalScore) {
  if (totalScore === 0) return 'A+';
  if (totalScore <= 5) return 'A';
  if (totalScore <= 10) return 'A-';
  if (totalScore <= 20) return 'B+';
  if (totalScore <= 30) return 'B';
  if (totalScore <= 50) return 'B-';
  if (totalScore <= 75) return 'C+';
  if (totalScore <= 100) return 'C';
  if (totalScore <= 150) return 'C-';
  if (totalScore <= 200) return 'D';
  return 'F';
}

Score Calculation:

Total Score = Σ(severity_points × issue_count)

Example:

1 Critical issue (100 pts) + 2 Medium issues (20 pts) = 120 points = Grade C-

File Scanning Process

1. Directory Traversal

// FileScanner recursively walks the directory tree
scanDirectory(dir) {
  for (entry in dir) {
    if (shouldIgnore(entry)) continue;

    if (isDirectory(entry)) {
      scanDirectory(entry); // Recursive
    } else if (isSupportedFile(entry)) {
      files.push(entry);
    }
  }
}

2. Ignore Rules

KafkaCode automatically respects: Built-in Ignores:

.git/
node_modules/
venv/, .venv/, env/
build/, dist/, target/
.next/, .nuxt/
coverage/, __pycache__/

Gitignore Patterns:

Reads and applies .gitignore rules
Uses glob pattern matching
Respects both file and directory patterns

3. Content Analysis

Each file is:

Read into memory
Scanned by PatternScanner
Analyzed by LLMAnalyzer (if patterns found)
Findings aggregated and classified

Performance Optimizations

Efficient File Scanning

Skips ignored directories early
Uses streaming for large files
Processes files sequentially to manage memory

Smart LLM Usage

Only analyzes files with pattern findings
Batches related issues together
Uses concise prompts to minimize tokens

Caching Strategy

Pattern regex compiled once
File stats cached during traversal
Reuses analysis for unchanged files

Memory Management

Files processed one at a time
Buffers cleared after analysis
Garbage collection optimized

Security & Privacy

Security-First Design

Built with security in mind:

✅ Pattern scanning happens locally
✅ No telemetry or tracking
✅ Open source and auditable
✅ Respects .gitignore automatically
✅ MIT licensed

KafkaCode is designed to help developers identify privacy and security issues in their code.

Extensibility

The architecture supports customization:

// Custom pattern scanner
class CustomPatternScanner extends PatternScanner {
  constructor() {
    super();
    this.patterns.customSecret = /MY_CUSTOM_PATTERN/g;
  }
}

// Custom analysis engine
const engine = new AnalysisEngine();
engine.patternScanner = new CustomPatternScanner();

Next Steps

Detection Methods

Learn about detection patterns

Privacy Grading

Understand the grading system

API Reference

Explore the programmatic API

Custom Patterns

Add your own detection patterns

Get Started

Core Concepts

Usage Guide

Advanced

Overview

Architecture

Dual-Layer Detection

Layer 1: Pattern-Based Detection

Layer 2: AI-Powered Analysis

Severity Classification

Critical

High

Medium

Low

Privacy Grading Algorithm

File Scanning Process

1. Directory Traversal

2. Ignore Rules

3. Content Analysis

Performance Optimizations

Security & Privacy

Security-First Design

Extensibility

Next Steps

Detection Methods

Privacy Grading

API Reference

Custom Patterns

Get Started

Core Concepts

Usage Guide

Advanced

​Overview

​Architecture

​Dual-Layer Detection

​Layer 1: Pattern-Based Detection

​Layer 2: AI-Powered Analysis

​Severity Classification

Critical

High

Medium

Low

​Privacy Grading Algorithm

​File Scanning Process

​1. Directory Traversal

​2. Ignore Rules

​3. Content Analysis

​Performance Optimizations

​Security & Privacy

Security-First Design

​Extensibility

​Next Steps

Detection Methods

Privacy Grading

API Reference

Custom Patterns

Overview

Architecture

Dual-Layer Detection

Layer 1: Pattern-Based Detection

Layer 2: AI-Powered Analysis

Severity Classification

Privacy Grading Algorithm

File Scanning Process

1. Directory Traversal

2. Ignore Rules

3. Content Analysis

Performance Optimizations

Security & Privacy

Extensibility

Next Steps