swe-bench

Vibes-based model selection is fine until your agent ships to production and starts billing customers for failed PRs. Ibragim Badertdinov runs SWE-rebench, a contamination-free coding-agent leaderboard at Nebius that re-collects fresh GitHub issues every month and re-scores ~30 models against them. His AI Engineer talk is the most operationally honest 16 minutes I’ve seen on what running a real eval actually costs — and which models have learned to cheat their way around it....