Ornith-1.0: Self-Scaffolding Open-Source Models for Agentic Coding Tasks
Deep Reinforce has introduced Ornith-1.0, a self-improving family of models (9B to 397B parameters) designed for agentic coding. By co-evolving task-specific scaffolds with the model's policy, it achieves competitive performance on coding benchmarks.
Why it matters
It moves away from fixed human-designed harnesses, allowing models to autonomously develop the orchestration logic needed for complex coding tasks.
TL;DR
- 01Self-improving scaffold architecture.
- 02Reduces reliance on human-designed test harnesses.
- 03Multi-layered approach to prevent reward hacking.
Self-Improving Scaffold Co-Evolution
Ornith-1.0 uses a training framework where scaffolding co-evolves with the policy. During RL, the model proposes a task-specific scaffold, then generates a solution rollout conditioned on it. Rewards optimize both the orchestrator and executor, leading to autonomous strategy emergence.
Mitigating Reward Hacking
To prevent reward hacking, Ornith-1.0 uses three isolation layers: an immutable outer trust boundary, a deterministic monitor, and a frozen LLM judge that acts as a veto.