Overview
KafkaCode uses a dual-layer analysis approach combining pattern-based detection with AI-powered contextual analysis to identify privacy and compliance issues in your source code.Architecture
1
File Discovery
The FileScanner component recursively scans your project directory, respecting
.gitignore rules and excluding common directories like node_modules, .git, and build.Supported File Types:- Python (
.py) - JavaScript (
.js) - TypeScript (
.ts) - Java (
.java) - Go (
.go) - Ruby (
.rb) - PHP (
.php)
2
Pattern-Based Analysis
The PatternScanner performs regex-based detection for known privacy issues:
- Hardcoded secrets (AWS keys, API tokens)
- PII patterns (emails, phone numbers)
- Sensitive keywords in assignment contexts
- High entropy strings
3
AI-Powered Analysis
The LLMAnalyzer uses advanced language models to:
- Understand code context
- Identify subtle privacy issues
- Reduce false positives
- Provide contextual recommendations
4
Report Generation
The ReportGenerator compiles findings and produces:
- Privacy grade (A+ to F)
- Severity classification
- Line-by-line details
- Actionable recommendations
Dual-Layer Detection
Layer 1: Pattern-Based Detection
Fast, deterministic scanning using regex patterns:- ⚡ Very fast execution
- 🎯 High precision for known patterns
- 📊 Zero external dependencies
- May miss context-dependent issues
- Can produce false positives
- Limited to predefined patterns
Layer 2: AI-Powered Analysis
Contextual analysis using LLM:- 🧠 Understands code context
- 🔍 Finds subtle privacy issues
- ✨ Reduces false positives
- 💡 Provides smart recommendations
- Takes code snippet and pattern findings
- Analyzes semantic meaning
- Identifies privacy concerns in context
- Suggests specific remediation steps
Severity Classification
Issues are classified into four severity levels:Critical
Score: 100 points
- Exposed API keys
- Private keys
- Database credentials
- OAuth tokens
High
Score: 50 points
- Sensitive data in code
- Payment information
- Authentication secrets
- Personal identifiers
Medium
Score: 10 points
- Email addresses
- Phone numbers
- High entropy strings
- Potential secrets
Low
Score: 1 point
- IP addresses
- URLs
- Configuration values
- Minor issues
Privacy Grading Algorithm
The privacy grade is calculated based on total severity score:- 1 Critical issue (100 pts) + 2 Medium issues (20 pts) = 120 points = Grade C-
File Scanning Process
1. Directory Traversal
2. Ignore Rules
KafkaCode automatically respects: Built-in Ignores:.git/node_modules/venv/,.venv/,env/build/,dist/,target/.next/,.nuxt/coverage/,__pycache__/
- Reads and applies
.gitignorerules - Uses glob pattern matching
- Respects both file and directory patterns
3. Content Analysis
Each file is:- Read into memory
- Scanned by PatternScanner
- Analyzed by LLMAnalyzer (if patterns found)
- Findings aggregated and classified
Performance Optimizations
Efficient File Scanning
Efficient File Scanning
- Skips ignored directories early
- Uses streaming for large files
- Processes files sequentially to manage memory
Smart LLM Usage
Smart LLM Usage
- Only analyzes files with pattern findings
- Batches related issues together
- Uses concise prompts to minimize tokens
Caching Strategy
Caching Strategy
- Pattern regex compiled once
- File stats cached during traversal
- Reuses analysis for unchanged files
Memory Management
Memory Management
- Files processed one at a time
- Buffers cleared after analysis
- Garbage collection optimized
Security & Privacy
Security-First Design
Built with security in mind:
- ✅ Pattern scanning happens locally
- ✅ No telemetry or tracking
- ✅ Open source and auditable
- ✅ Respects .gitignore automatically
- ✅ MIT licensed
KafkaCode is designed to help developers identify privacy and security issues in their code.

