Dupehound: Offline and Deterministic Code Duplicate Detector for Agentic Codebases
Dupehound is a fast, local command line interface tool that uses Abstract Syntax Tree structure fingerprinting to catch duplicate functions written by AI agents. By integrating it into continuous integration pipelines or feeding its output back to Large Language Models, developers can prevent code duplication and context bloat.
Impact: High
Why it matters
It solves the agent-induced code-bloat problem locally and deterministically without wasting API tokens or relying on heavy machine learning models.
TL;DR
- 01Solves AI agent code-bloat by structurally fingerprinting codebases using tree-sitter ASTs.
- 02Runs entirely offline and deterministically, scanning millions of lines in seconds (3.6s for VS Code).
- 03Integrates into CI via pre-commit hooks or GitHub Actions to block duplicate PRs.
- 04Feeds structural warnings directly to coding agents via CLAUDE.md to enforce code reuse.
Key facts
- Supported Languages
- TypeScript, TSX, JavaScript, Python, Rust, Go, Java, Ruby, Swift
- Scan Speed (VS Code 2.97M lines)
- 3.6s on a standard laptop
- Minimum Token Threshold
- 40 normalized tokens per function
- Exit Codes for CI Check
- 0 clean, 1 findings, 2 error
AST-Based Structural Fingerprinting
Unlike text-based search engines, dupehound drops comments, replaces identifiers, strings, and numbers with sentinels, and analyzes the underlying abstract syntax tree. It uses k-grams of 10 tokens with rolling hashes and robust winnowing to guarantee that any shared sequence of 17 normalized tokens is caught. Similarity is calculated using exact Jaccard index, and matching clusters are generated via union-find.
CLI Commands and Integration
Dupehound provides three core commands:
dupehound scan [path]scans a directory, ranks duplicate clusters by deletable lines, and outputs a 'slop score' representing the percentage of redundant code.dupehound historyreads git blobs directly from the object database without checking out files, mapping out exactly when duplication spiked over time.dupehound checkoperates as a CI gate or pre-commit hook. It indexes the codebase at the base git revision and analyzes only the newly added or modified functions, exiting with code 1 upon discovering duplicates.
Prompting Agents for Code Reuse
To prevent agents from generating duplicate code, developers can pipeline the output of dupehound check directly to their agent. Placing the output or guidelines within a CLAUDE.md or AGENTS.md file forces the agent to inspect the existing original function and refactor its code to reuse it rather than committing new redundant blocks.
Try it in 2 minutes
brew install rafaelpta/dupehound/dupehound
dupehound scan .
dupehound checkbash
✓ When to use
- When working heavily with agentic integrated development environments like Claude Code, Cursor, or specialized code generation agents.
- When you need a deterministic, reproducible merge gate for CI without network or API key dependencies.
- When refactoring a large legacy codebase written in one of the supported languages.
What to do today
- Install dupehound locally via Homebrew: `brew install rafaelpta/dupehound/dupehound`
- Run `dupehound scan .` on your active project to check your current 'slop score'.
- Configure a pre-commit hook or GitHub Action using `dupehound check` to fail on duplicate logic.
- Update your `CLAUDE.md` or `AGENTS.md` instruction files to consume check logs for code reuse.
Sources