Claude Opus 45 and why evaluating new LLMs is increasingly difficult