Anthropic just launched two new AI models, Claude Opus 4 and Claude Sonnet 4 (a drop-in replacement for Claude 3.7 Sonnet), which hit the market on May 22.
Both of these models have similar SWE benchmarks, so in this blog, we will mainly focus on Claude Opus 4. ✌️
Now that this new model, Claude Opus 4, is launched, let's see if we have something cool or just another regular AI model. 👀
TL;DR
If you want to jump straight to the conclusion, when Claude Opus 4 is compared against the other two models, Gemini 2.5 Pro and OpenAI o3, Opus simply dominates and that too by a good margin in coding which you can see for yourself below in the comparison.
If you are looking for a good AI coding assistant, maybe for your editor or in general, Claude Opus 4 is the best option for you (at least for now!)
Brief on Claude Opus 4
If you are on this blog, it's likely for the Claude Opus 4 model, so let me give you a brief introduction to this model before we move any further.
It hasn't even been a week since this model was launched, and they claim it to be the best AI model for coding. Not just that, but an AI model that could autonomously work for a full corporate day (seven hours). Looking scary already!! 😬
It has about a 200K token context window (not the numbers you might expect, but it is what it is), and it's said to be the best model for coding. It should justify this, but we'll see in just a moment.
Claude Opus 4 leads on the SWE-bench with a score of 72.5% and can reach up to 79.4% with parallel test-time compute.
As you can see, there's already over a 10% improvement over Anthropic's previous model, Claude 3.7 Sonnet.
This Claude 4 lineup also marks a 65% lower chance of the model using hacky and shortcut methods to get the job done.
Now, imagine an AI model (in this case Claude Opus 4) doing PRs, making commits, and doing everything you can think of all on its own with just a few prompts. How cool would that be, right?
Here's exactly that. The Claude team has shared this quick GitHub Actions integration with Claude Opus 4, where you can see the model making changes on the PR and addressing feedback in real-time. 👇
Doesn't this look a bit dangerous to you? How quickly the amount of control these AI models are already taking has increased in these 2-3 years from GPT-3.5 to this model.
This is getting really insane, and I'm not sure if I love or hate this happening. 🥴
Coding Comparison
As you might have already guess, In this section, we will be comparing Claude Opus 4 (SWE 72.5%) vs. Gemini 2.5 Pro (SWE 63.2%) vs. OpenAI o3 (69.1%) on coding.
💁 All three of these models are coding beasts, so we won't be testing them with any easy questions. We'll use really tough ones and see how they perform head-on.
One thing I will also account for is taste.
1. Particles Morph
Prompt: You can find the prompt I've used here: Link
Response from Claude Opus 4
You can find the code it generated here: Link
Here’s the output of the program:
This looks crazy good, and the fact that after thinking for about 100 seconds (~1.66 minutes) it was able to do this in one shot is even crazier to me. The particles morph behavior from one shape to another is exactly how I expected; it does not start from one point and morph to another shape but right from the shape it's in.
There is room for improvement, like the shapes aren't really 100% correct, but the overall implementation is rock solid!
Response from Gemini 2.5 Pro
You can find the code it generated here: Link
Here’s the output of the program:
This is not bad, but it's definitely not at the Gemini 2.5 Pro level of quality. The shapes look poor and don't really meet the expectations I had. Is that how the bird looks? Seriously? The overall UI is also not up to par.
This is definitely not what I was expecting and somewhat disappointing from this model, but we're comparing it (SWE bench 63.2%) to Claude Opus 4 (SWE bench 72.5%), and maybe that's the reason.
🫤 I've noticed that after every new model is launched, the previous best model seems to fade in comparison to the new one. How fast the AI models are improving is just crazy.
Response from OpenAI o3
You can find the code it generated here: Link
Here’s the output of the program:
The response we got from the o3 is even worse than from the Gemini 2.5 Pro. Honestly, I was expecting a bit more from this model, yet here we have it.
I'm not sure if you noticed, but the particles don't morph directly from their current shape; instead, they first default to a spherical shape and then morph to the requested shape.
2. 2D Mario Game
Prompt: You can find the prompt I've used here: Link
Response from Claude Opus 4
You can find the code it generated here: Link
Here’s the output of the program:
It did it in a matter of seconds. Implementing a whole 2D Mario game, which is super difficult, in just a matter of seconds is a pretty impressive feat.
And not just that, look at how beautiful the UI and the overall vibe are. This could actually serve as a solid start for someone who's trying to build a 2D Mario game in vanilla JS.
Response from Gemini 2.5 Pro
You can find the code it generated here: Link
Here’s the output of the program:
It is functional, I must say that, and it's somewhat good. But it's a bit too minimal and also a bit buggy.
If you see the timer running in the top right, it's just not working correctly (I am not so familiar with this game, and maybe this is how it works), but whatever, this doesn't feel like a good output from a model considered this good.
Response from OpenAI o3
You can find the code it generated here: Link
Here’s the output of the program:
o3 didn't really do any good on this question. As you can see, it just looks like a prototype and not even a working game. It's complete nonsense, and there's no real Mario game here. It has lots and lots of bugs, and there's no way the game ends.
Disappointing result from this model one more time!👎
3. Tetris Game
Prompt: You can find the prompt I've used here: Link
Response from Claude Opus 4
You can find the code it generated here: Link
Here’s the output of the program:
As you can see, we got a perfectly implemented Tetris game with vanilla HTML/CSS/JS in no time, I even forgot to keep track of it. It did it that fast.
It did implement everything I asked for, including optional features like the ghost piece and high score persistence in local storage. You might not hear it, but it also implemented background theme music and the next three upcoming pieces.
Tell me, for real, how long would this take you if you were to code this all alone, with no AI models?
Response from Gemini 2.5 Pro
You can find the code it generated here: Link
Here’s the output of the program:
This one is equally good and works perfectly like the Claude Opus 4; even the UI and everything looks nice. I love that it could come up with a nice solution to this problem.
Response from OpenAI o3
You can find the code it generated here: Link
Here’s the output of the program:
This one's interesting. Everything from the tetriminos falling to everything else seems to work fine, but there's no way for the game to end. Once the tetriminos hit the top, the game is supposed to end, but it doesn't, and the game is simply stuck forever.
Now, this could be an easy fix in the follow-up prompt, but this is a pretty simple question, so I decided to just do it in one shot. Not that big of an issue, but still.
4. Chess Game
Prompt: You can find the prompt I've used here: Link
Response from Claude Opus 4
You can find the code it generated here: Link
Here’s the output of the program:
Now, this is out of this world. It implemented an entire chess game from scratch with no libraries. I had thought it would use something like Chess.js or any other external libraries, but there you have it, a fully working chess game, even though it misses some moves like "en passant" and some other specific moves.
Other than piece-specific moves, all the moves are calculated perfectly in the move log. This is pure insanity!
Response from Gemini 2.5 Pro
You can find the code it generated here: Link
Here’s the output of the program:
Gemini 2.5 Pro also decided to implement everything from scratch, and it has also tried to implement other moves like "en passant," not just piece-specific moves.
The game overall seemed fine, but the soul of Chess is missing. The pieces are just there; they don't move. This felt like a small issue that it could easily fix in follow-up prompts, but it did not.
You can find it's updated code from the follow-up prompt here: Link
Response from OpenAI o3
You can find the code it generated here: Link
Here’s the output of the program:
OpenAI o3 took a more solid approach and decided to use Chess.js, which I'd prefer if I were looking to build a production-level Chess game, but the implementation didn't really fit.
It seems like the external Chess.js imports didn't work and are failing as it's trying to use the Chess object, which is undefined
.
Conclusion
Did we get a clear winner here? Yes, and absolutely yes, and it's Claude Opus 4.
Amazon funded Anthropic is doing some real magic with these Claude models, first my earlier favorite, Claude 3.7 Sonnet and now the two beasts (Claude Sonnet 4 and Claude Opus 4).
Claude Opus 4 is completely better than the other two models, even though it has a much lower token context window compared to the other two. Being this much better in coding with such a low context window is by far the best thing I've seen recently in this AI boom.
What do you think and which one do you pick for yourself? Let me know in the comments below!
Top comments (34)
Coding comparisons are nice, how much $$$ did they cost?
I think also it can distort the findings? Is Opus available for free? If not, why it is not compared to o4-mini-high, for example?
No, Claude Opus 4 is not a free model. You can use Sonnet 4 for free with limited prompts. I've used models that are pretty similar in the benchmarks, nothing specific for the pickings.
I could include o4-mini-high, but that's a bit low performing model in coding, especially when comparing with the best AI model for coding, Claude Opus 4.
I've used o4-mini in one of the comparison blogs, you can check it out here.
Not much though. The other two models are pretty cheap, it's just the Claude Opus 4 that has a slightly higher price per million input-output tokens.
75$ is only slightly higher then 15$???
I do not understand. Is this the only use case why everybody tries to create games or 3D visualizations with AI? Why don't you pick the real practical examples where commercial development could be demonstrated?
It's pretty tough to figure out an exact real practical example to use for these testing. If I had one, I'd definitely use it.
Pretty insane how fast these models level up - I gotta admit, seeing AI spit out better code than me kinda stings but also fires me up to keep learning.
The same situation is there for many devs nowadays :)
true
true
🙌
Don't tell me this is built in one shot. 🤯 Are we cooked then? How is it building such thing in pure HTML/CSS/JS as I don't think there is much data like this which they have been train
Yes, it did it in one shot and I'm not kidding.
I do not understand. Is this the only use case why everybody tries to create games or 3D visualizations with AI? Why don't you pick the real practical examples where commercial development could be demonstrated?
What about testing this on hard level leet code questions? I tried sonnet 4 on one of the hard leet code questions and it got time limit exceeded. Clearly the ai's are heavily trained on web dev but when it comes to general coding id say gemini is better.
For this test, I decided to focus entirely on building stuff and not on algo/leetcode questions. You can check out some of my earlier comparisons where I've tested most of the models on leetcode and CodeForces questions as well.
Still just toy factories.
If let loose in the wild with some greedy CEO that wants to save money on heavy review, the code these models produce will end up killing people en-masse.
Don't forget, AI is just starting out and there's still going to be lots and lots of improvements in the coming years. And also the models already launched are also going to get tons of upgrades moving forward.
Maybe, maybe not.
Super comprehensive breakdown - it's wild to see how fast these coding models are leveling up! Have you tried plugging Claude Opus 4 into your actual dev workflow yet, or just for these benchmarks?
Yeah, the real test comes with the actual dev workflow. I have yet to use it properly in my dev workflow, but so far it's been doing fine.
For a few more years, companies launching their new AI models to capture tech, like Microsoft owning the entire dev ecosystem, is a never-ending rat race.
You said it correctly, currently it seems like a race between Google, Anthropic, OpenAI, and a few others. All these new models launching every few weeks with such improvements are really out of this world to me. At least we are getting something better and better with every single release.
I don't need any of this.
Folks, let me know your thoughts on this model, Claude Opus 4, in the comments. This one is wild, and do you see it being your go-to model when it comes to coding? 👀
Claude Opus 4 represents a significant advancement in AI language models, particularly because of its emphasis on coding, long-term task management, and safety. Unlike traditional models, Opus 4 appears to focus heavily on supporting complex workflows and reinforcing ethical guidelines, making it suitable for professional and sensitive applications.
Thank you for these info. Have you tried using Opus 4 in your workflow?