GPT-5 can beat some of the harder missions

But just how good is it?

The Upshot for GPT-5:

The best win rate so far: 100% on easy/medium missions, up to 73% on hard missions
So risk-averse that it rarely completes games in the minimum number of tricks, performing worse than GPT-5-Mini on efficiency
8 failed missions break down to complex planning failures (5x), miscounting cards (2x), and deliberate suicide (1x)
Substantially better than other models at realizing when the game has been irretrievably lost

GPT-5 and its smaller siblings, GPT-5-mini and GPT-5-nano, are obviously and consistently performing much better than other models.

Mission-by-Mission Performance

Although it succeeds in 100% of cases on the easy and medium missions, GPT-5 almost never succeeds in the minimum number of tricks. This cautiousness has some benefit: unlike more aggressive models, like GPT-4o-mini, GPT-5 never loses on the first trick (save once, deliberately failing a mission to restart the game). GPT-5 also has the rare distinction of recognizing game failures better than the other models (see GPT-5's Understanding of Failure) Many games are possible to win in one trick, but GPT-5 takes a median of 3 tricks to win them.

If we measure the overall optimal play rate as the share of trials won in the minimum number of tricks, the full GPT-5 model is beaten by GPT-5-Mini, GPT-5-nano and even GPT-4.1-Mini. The smaller models play very aggressively, so even though they win less often, they win faster.

Programmers Hate This One Weird Trick

GPT-5 is slow overall, but sometimes it can play very efficiently indeed. It caused me a brief panic when I ran grading on trick efficiency, and saw that its best performance was -1 trick: one fewer tricks than the optimal solution. It was a mission that should take two tricks to win, but GPT-5 was recording a "win" after only one trick had been played, which wasn't possible. I was dreading the amount of debugging this would take, not to mention the serious annoyance of having to re-run all the missions to collect new data (...again.)

But when I pulled the game logs, I realized there was no cause for alarm. GPT-5 was right, and I had been wrong: there was a better way to beat the mission than the one I had written down, and it was completely legal. There wasn't a bug after all.

Here's one of the missions in question. Can you see what I missed? GPT-5 only saw the optimal solution in 4 of its 10 trials.

Loading hands...

Trick 1

waiting for commander to lead...

My original two trick solution: click to reveal

GPT-5's one trick solution: click to reveal

Where does GPT-5 still fail?

In addition to a fair amount of inefficiency in getting to the end of the game, GPT-5 lost 8 games in the 3 hardest missions. Why?

The following diagrams show the final trick of the game: what options each player had, their reasoning, and their final choice.

Note: Check out the Mission Demos & Details page to see demos of the other missions and more details on how LLMs performed.

3 Failures - Commander has GREEN 5 and needs to win it (Expand for details)

In the first failure, GPT-5 believed that the GREEN 4 was still in play, even though it had in its context window the game history showing that the GREEN 4 had been played. This failure to count cards correctly means that it makes an unforced error in leading the GREEN 5, which it needs to win.

By the time they got to this trick, they had actually played into an unwinnable situation. Although if they played the GREEN 6, they could flush out the GREEN 7, GPT-5 is correct that this actually wouldn't help. The problem is that they have left this too late, with only 2 tricks remaining. If they can't lead the GREEN 5 next turn, they also lose - GREEN must be the led suit in order for the GREEN 5 to win, and the GREEN 5, GREEN 6, and GREEN 7 were the only GREEN cards remaining. If Player Delta won the trick with the GREEN 7, they wouldn't be able to lead GREEN on the final trick.

The final failure was also unwinnable when the final trick started. Because Player DELTA had only ROCKET cards remaining in hand, there was no way for the commander to win the trick with the GREEN 5.

1 Failure - Commander does not have GREEN 9 and needs to win it (Expand for details)

GPT-5 only failed this mission once, when the player Alpha, who had the GREEN 9, incorrectly believed that the commander was void in GREEN and would be able to play a ROCKET card.

4 Failures - Commander has GREEN 1 and needs to win it (Expand for details)

I mention this one in the main results for its interest value: this is GPT-5's only 1-trick failure in all 100 trials, and it isn't a mistake, or a series of mistakes, but actually a deliberate suicide.

GPT-5 also failed the mission more normally, in three additional ways.

In two failures, GPT-5 has stalled all the way to trick 10, so no players have a choice (all players must play their final card). Failure ensues because one player has a ROCKET left, which trumps the GREEN 1. (The LLMs aren't queried when they have only one option, which makes these two missions a little boring.)

The last failure is also similar - the 3 ROCKET cards dealt to Player Delta at the beginning of the game sink the mission, since Player Delta failed to play any, and didn't foresee that becoming trapped with nothing but ROCKET cards in their hand would be an issue.

GPT-5's Understanding of Failure

I noticed something interesting in GPT-5's reasoning traces on the final turn. Many of the GPT-5 models mention that the game has already been lost by other players, or that they're being forced to make a bad choice.

To compare this to other models, I had Gemini 2.5 Flash look through every reasoning trace for players in lost games. I defined this as cases when the game had already been lost, but the trick wasn't yet over. This wound up being 20-60 applicable player turns per model, with GPT-5 having among the fewest, due to its high success rate. I asked Gemini 2.5 Flash to gauge whether or not each player seemed aware of the doom.

I defined the following categories:

Incorrectly stated mission was still winnable
No mention of imminent failure
Mention of imminent failure
Correctly identified and explained doom

From this perspective, it's clear that GPT-5's improvement is really more a decrease in delusion over the game state. Most other models incorrectly assert the game is winnable (or that they are taking steps to win it) over 30% of the time, while GPT-5 is only at 7%, and GPT-5-Mini only at 10%.