New research reveals that even frontier LLMs suffer from 'catastrophic degradation' when delegating long-horizon document editing. Despite their power, models struggle with context rot and document corruption, proving that raw model intelligence is insufficient for reliable autonomous workflows.
Topics: AI agents, Delegate 52 benchmark, LLM reliability, harness engineering, document editing