Overview
KafkaCode uses a dual-layer analysis approach combining pattern-based detection with AI-powered contextual analysis to identify privacy and compliance issues in your source code.Architecture
File Discovery
The FileScanner component recursively scans your project directory, respecting
.gitignore rules and excluding common directories like node_modules, .git, and build.Supported File Types:- Python (
.py) - JavaScript (
.js) - TypeScript (
.ts) - Java (
.java) - Go (
.go) - Ruby (
.rb) - PHP (
.php)
Pattern-Based Analysis
The PatternScanner performs regex-based detection for known privacy issues:
- Hardcoded secrets (AWS keys, API tokens)
- PII patterns (emails, phone numbers)
- Sensitive keywords in assignment contexts
- High entropy strings
AI-Powered Analysis
The LLMAnalyzer uses advanced language models to:
- Understand code context
- Identify subtle privacy issues
- Reduce false positives
- Provide contextual recommendations
Dual-Layer Detection
Layer 1: Pattern-Based Detection
Fast, deterministic scanning using regex patterns:- ⚡ Very fast execution
- 🎯 High precision for known patterns
- 📊 Zero external dependencies
- May miss context-dependent issues
- Can produce false positives
- Limited to predefined patterns
Layer 2: AI-Powered Analysis
Contextual analysis using LLM:- 🧠 Understands code context
- 🔍 Finds subtle privacy issues
- ✨ Reduces false positives
- 💡 Provides smart recommendations
- Takes code snippet and pattern findings
- Analyzes semantic meaning
- Identifies privacy concerns in context
- Suggests specific remediation steps
Severity Classification
Issues are classified into four severity levels:Critical
Score: 100 points
- Exposed API keys
- Private keys
- Database credentials
- OAuth tokens
High
Score: 50 points
- Sensitive data in code
- Payment information
- Authentication secrets
- Personal identifiers
Medium
Score: 10 points
- Email addresses
- Phone numbers
- High entropy strings
- Potential secrets
Low
Score: 1 point
- IP addresses
- URLs
- Configuration values
- Minor issues
Privacy Grading Algorithm
The privacy grade is calculated based on total severity score:- 1 Critical issue (100 pts) + 2 Medium issues (20 pts) = 120 points = Grade C-
File Scanning Process
1. Directory Traversal
2. Ignore Rules
KafkaCode automatically respects: Built-in Ignores:.git/node_modules/venv/,.venv/,env/build/,dist/,target/.next/,.nuxt/coverage/,__pycache__/
- Reads and applies
.gitignorerules - Uses glob pattern matching
- Respects both file and directory patterns
3. Content Analysis
Each file is:- Read into memory
- Scanned by PatternScanner
- Analyzed by LLMAnalyzer (if patterns found)
- Findings aggregated and classified
Performance Optimizations
Efficient File Scanning
Efficient File Scanning
- Skips ignored directories early
- Uses streaming for large files
- Processes files sequentially to manage memory
Smart LLM Usage
Smart LLM Usage
- Only analyzes files with pattern findings
- Batches related issues together
- Uses concise prompts to minimize tokens
Caching Strategy
Caching Strategy
- Pattern regex compiled once
- File stats cached during traversal
- Reuses analysis for unchanged files
Memory Management
Memory Management
- Files processed one at a time
- Buffers cleared after analysis
- Garbage collection optimized
Security & Privacy
Security-First Design
Built with security in mind:
- ✅ Pattern scanning happens locally
- ✅ No telemetry or tracking
- ✅ Open source and auditable
- ✅ Respects .gitignore automatically
- ✅ MIT licensed
KafkaCode is designed to help developers identify privacy and security issues in their code.

