Hacker News story: Opus 4.7 vs. 4.6 after 3 days of real coding side by side from my actual session

Opus 4.7 vs. 4.6 after 3 days of real coding side by side from my actual session
I spent some time today comparing Opus 4.6 and 4.7 using my own usage data to see how they actually behave side by side. still pretty early for 4.7, but a few things surprised me. In my sessions, 4.7 gets things right on the first try less often than 4.6. One-shot rate sits around 74.5% vs 83.8%, and I am seeing roughly double the retries per edit (0.46 vs 0.22). It also produces a lot more output per call, about 800 tokens vs 372 on 4.6, which makes it noticeably more expensive. cost per call is $0.185 vs $0.112. when I broke it down by task type, coding and debugging both looked weaker on 4.7. Coding one-shot dropped from 84.7% to 75.4%, debugging from 85.3% to 76.5%. Feature work was slightly better on 4.7 (75% vs 71.4%), but the sample is small. Delegation showed a big gap (100% vs 33.3%), though that one only has 3 samples on the 4.7 side so I wouldnt read much into it yet. 4.7 also uses fewer tools per turn (1.83 vs 2.77) and barely delegates to subagents (0.6% vs 3.1%). Not sure yet if that's a style difference or just the smaller sample. A couple of caveats. This is about 3 days of 4.7 data (3,592 calls) vs 8 days of 4.6 (8,020 calls). Some categories only have a handful of examples. These numbers will shift with more usage, and your results will probably look different depending on what kind of work you do. npx codeburn compare 0 comments on Hacker News.
I spent some time today comparing Opus 4.6 and 4.7 using my own usage data to see how they actually behave side by side. still pretty early for 4.7, but a few things surprised me. In my sessions, 4.7 gets things right on the first try less often than 4.6. One-shot rate sits around 74.5% vs 83.8%, and I am seeing roughly double the retries per edit (0.46 vs 0.22). It also produces a lot more output per call, about 800 tokens vs 372 on 4.6, which makes it noticeably more expensive. cost per call is $0.185 vs $0.112. when I broke it down by task type, coding and debugging both looked weaker on 4.7. Coding one-shot dropped from 84.7% to 75.4%, debugging from 85.3% to 76.5%. Feature work was slightly better on 4.7 (75% vs 71.4%), but the sample is small. Delegation showed a big gap (100% vs 33.3%), though that one only has 3 samples on the 4.7 side so I wouldnt read much into it yet. 4.7 also uses fewer tools per turn (1.83 vs 2.77) and barely delegates to subagents (0.6% vs 3.1%). Not sure yet if that's a style difference or just the smaller sample. A couple of caveats. This is about 3 days of 4.7 data (3,592 calls) vs 8 days of 4.6 (8,020 calls). Some categories only have a handful of examples. These numbers will shift with more usage, and your results will probably look different depending on what kind of work you do. npx codeburn compare

US Economy News

Hacker News story: Opus 4.7 vs. 4.6 after 3 days of real coding side by side from my actual session

No comments:

Follow Us

Recent Posts

Popular Posts

Search This Blog

Random Posts

Tags

Recent Posts