LLMs play the Crew

Can language models cooperate without communication?

My own personal copy of The Crew: The Quest for Planet Nine!

The Crew is a card game (a cooperative trick-taking game) where you can't talk to your teammates. LLMs are language models that, in a sense, can only talk. But with the right framework, LLMs can play games like Pokémon and Diplomacy. LLM agents can even work together on tasks and projects. But what does it mean for LLMs to work together on a team without communication?

Disclaimer: this is an unofficial computational implementation for research purposes only, all game mechanics and IP belong to Thomas Sing and Kosmos. LLMs can play this version, but humans should experience the real thing with actual cards and actual friends! Go buy your own copy, available basically everywhere board games are sold.

The story so far:

8 OpenAI models x 10 mission setups x 10 trials for each

Anything bigger and newer than GPT-4o-mini plays better than chance
OpenAI's newest model, GPT-5, briefly outsmarted me on 2 missions (more on that here)
Best model tested so far is GPT-5 with 100% on easy/medium, but only 73% on hard missions
Everything else = no better than random on hard missions

When people think about cooperation, they often think about the Prisoner's dilemma. But that's a simplified construct in an adversarial framing. In the real world, people fail to work together all the time, even when they're really trying.

The Crew brings together some of the elements that make this difficult into a gaming framework:

Theory of mind - What cards might my teammate have? What do they think I think?
Implicit signals - What does this play communicate to my teammates? What should my teammates' plays tell me?
Sequential planning - How does the game change next trick? The trick after that?
Role identification - What is my role in the team? What should I do?

I found that the smaller models struggle both with reasoning through the gameplay, and with reasoning about their own role in the mission.

Here's GPT-4.1-Nano deciding to win the trick on the commander's (GAMMA's) behalf and overtaking the GREEN 7, even after identifying the importance of the commander winning the trick. In some cases, it's not always clear if the confusion is caused by limited bad options, but in this case, GPT-4.1-Nano actually identifies the option to strategically lose the trick by playing the GREEN 4 - and then decides not to do it.

I must follow suit with a green card. The current winning card is GREEN 7 played by PLAYER-GAMMA. My options are GREEN 4 and GREEN 8. Playing GREEN 4 would not beat the current winning card, so it is only reasonable if I am trying to intentionally lose this trick for some strategic reason, but that doesn't serve our task to help GAMMA win the trick containing GREEN 7, which he already has. Playing GREEN 8 will beat the current GREEN 7, ensuring GAMMA wins the trick as required, and helping achieve our team's goal. Since GAMMA is the commander and the task is to have GAMMA win the trick with GREEN 7, I should play the GREEN 8 to help him secure the win and fulfill his task.

This does not "help him secure the win and fulfill his task." It loses the game.

About the game (Click to expand)

Never heard of the Crew? Start here!

The Crew is a card game for 2-5 players aged 10+, with a playtime of about 20 minutes. BoardGameGeek rates it a 1.96/5 for complexity (light but non-trivial) and a 7.8/10 overall (it's a good game!).

The trick-taking: Trick-taking refers to the format of the game, where players take turns playing cards from their hand and the highest card wins the trick. "Hearts" and "Spades" are both trick-taking games.

The deck: 40 cards total

36 cards (4 colored suits numbered 1-9 ranks) + 4 ROCKET cards (numbered 1-4, trump suit which beats all other cards)

green:

blue:

pink:

yellow:

Rockets (Trump):

The game:

The entire deck is shuffled, divided evenly, and dealt to each player. In a four-player game, each player gets ten cards.
One player leads a card. The color of that card becomes the "led suit."
Other players, in order, must play a card of the same suit if they can. If not, they can play any card.
The player with the highest card of the led suit, or the highest ROCKET, wins the trick.
The winner of the trick leads the next trick. This repeats until all cards are played or the mission ends.

The "cooperative": Players are all on the same team. The goal isn't for individual players to avoid or win specific tricks, but for all players to collaborate on making specific outcomes happen.

The catch: No table talk, at all. Coordinate through gameplay alone. Each player is given the chance to communicate a structured piece of information about a single card once per mission - the chance to indicate whether a non-ROCKET card is their highest, lowest, or only card of that suit. Only cards meeting one of those conditions can be communicated.

It's one of my favorite games to teach to a new crowd, when I want to gauge how good they really are at board games (and keep everyone on the same team). So, a perfect litmus test for LLM cooperation.

Wait, how do LLMs play a game? (Click to expand)

All the aspects of the game (cards, missions, game logic) are tracked computationally in the Python code. At each turn, the LLM player receives an entirely text based explanation of the rules, the current state of the game (including previous turns), any "notes for self" they wrote from the previous turn, in addition to their hand and their current valid options. Using structured outputs, the LLM chooses from among the legal moves, submits a short explanation of their reasoning, and optionally, writes a note for the next turn. If they choose an invalid move or an illegal response, the code re-submits the query to them with a brief explanation that their last output was invalid, and what a valid output should look like. Every turn, the game state is updated for the next turn, and so on, until the game ends in victory or defeat. The code is available under the MIT license in a github repo, here.

Results

The Crew is a campaign game with 50 missions, but early on I saw how much trouble the models were having with just the base game logic in Mission 1. So I pivoted and shrank the scope, and I made ten different versions of Mission 1, which is much simpler. For each model, I ran ten trials of each mission, in teams of 4, using the same model for all 4 players on the team.

Especially if you've never played before, try one of the missions for yourself, here.

Some setups are easier than others: the easiest of the missions, random chance alone could win over 50%. The hardest mission had only 1 victory from random chance in 200 trials.

I grouped mission setups by how often random chance could win them, and how many tricks (rounds of gameplay) were theoretically needed to win:

Trivial (2 missions, 61% and 67% success rates, easily won in a single trick)
Easy (2 missions, 27% and 31% success rates, possible to win in a single trick with good coordination)
Medium (3 missions, 13%, 13%, and 14% success rates. I originally rated these as 2-3 tricks to win, but GPT-5 found a way to one-shot two of them that I missed! See the details here)
Hard (3 missions. 0.5%, 3%, and 9% success rates, takes a minimum of 3-4 tightly coordinated tricks to win)

As the number of conditional steps that have to go right increases, the smaller models fall off and do no better than randomly, and then worse than random chance.

GPT-4o-mini never does better than random on the trivial to easy missions, and becomes significantly worse than random on the medium and hard missions. In fact, it's the only model to do worse than random on even the medium missions.

Only the new GPT-5 series models do better than random on the medium missions, and only the full GPT-5 model does better than random on the Hard missions.

Success Efficiency

There's another way we might think about success rates: how efficiently are they winning?

We can think of these very much like chess puzzles, where "tricks to win" is roughly equivalent to "moves to checkmate." Here, we still start from an initial deal, but the right combination of moves can win in a single trick.

This matters for two reasons (1) A model that understands the game well and coordinates optimally with their team should win in the minimum number of moves. (2) Like chess, although these games start winnable, it's possible to play yourself into an unwinnable position.

If we plot "optimal play" as winning 100% of the time, in the minimum number of tricks, we can see how close the models are to that goal, and where they stand in the spectrum between random (plotted as a black square) and optimal (plotted as a white star) performance.

By this measure, optimal play is still a long ways off, especially in the missions which require multi-trick planning in order to win. In particular, GPT-5 isn't actually cruising toward victory so much as managing to avoid failure until it stumbles into victory.

So - what does it look like when they fail? What's going on?

Failure Modes

The mission is for the commander (the player who's dealt the ROCKET 4 card) to win the task card. Missions fail when anyone except the commander wins the target card. Simple rule, complex failures.

This might be harder than you think. Consider the following deal (you get to be omniscient, but our poor LLMs aren't, just like real players wouldn't be):

What would you play? How fast can you win? This is one of the harder missions, with a 9% chance of getting it through random play. GPT-5 manages a 70% success rate, and the rest of the OpenAI models do randomly or worse than random.

Loading hands...

Trick 1

waiting for commander to lead...

What you probably noticed is the difficulty that ensues when the commander has the task card, and other players have much higher cards in the same suit. With a bird's eye omniscience, you can arrange things after giving it some thought - but if you were in a real game, you wouldn't be able to see what other players' hands were like, and the coordination between players would be much more difficult.

The three scenarios where the commander is dealt a low value task card (the GREEN 1, 3, and 5) all suffer from this same constraint. The commander needs to be able to win the trick with the task card itself, but the task card is low - so other players need to void themselves in the suit, or dump any cards higher than the task card, in order to let the commander win.

Smaller LLMs almost universally fail to anticipate this problem. They get tempted to lead the task card on the first or second turn, when it's physically impossible for them to win.

Poor GPT-4o-mini, our worst performer, is also the most egregious here. It immediately tanks the mission 87% of the time, resisting the temptation to lead the task card in only 4 of 30 trials.

The obvious question here is: but what are they thinking? Happily, as LLMs, we can answer that. Added to every move query is an instruction for the LLM to submit their reasoning for the move.

I need to complete the task of winning the trick containing GREEN 1. Since I'm leading, I will play GREEN 1 immediately to ensure that I win the current trick. This strategy guarantees that I complete my assigned task right away and prevents other players from winning the card that I specifically need to win.

Other models have a similar train of thought:

I need to trigger Task #1: win the trick that contains GREEN 3, which is in my hand. Since I am leading, I should play GREEN 3 to ensure the trick contains that card. If I lead GREEN 3, I'm risking losing the trick to higher greens (5 or 9) or to a ROCKET; however, any alternative (e.g., leading a non-GREEN card) would immediately remove GREEN 3 from the current trick, making it impossible for me to win it.

Curiously enough, GPT-5 also makes a suicidal lead in one trial, but its reasoning is entirely different. GPT-5 deliberately fails the mission to restart the game, incorrectly believing that because winning was not possible in the first move, it would never be possible:

Our sole task is that I (Alpha) must win the trick containing GREEN 1, and I hold GREEN 1. From communications: Delta's only green is G3 (so will always follow green with a higher card than G1), and Gamma's highest green is G8 (so also higher). If green is led, I must follow suit and G1 can never be the highest green; at least one teammate will beat it. I cannot win the G1 trick using a rocket because I must follow suit when holding green. Therefore the task is impossible. Best team play is to demonstrate the impossibility quickly to restart without burning resources. Leading GREEN 1 ensures immediate and clear failure on trick 1.

LLM players can also fail in much more complex ways, but what puts the low task card lead into its own category is three things: first, it's the very first turn of the first trick, it's an immediate failure on very first move of the game. Second, it's irrevocable (no way for other players to recover, even if they notice the problem). Third: it's so painfully human. I've seen human beginners make the same error, if only once, in their first game. But then again: these are off the shelf LLMs, not trained to play the Crew, and never given the chance to learn from failures: all things my human players take for granted.

The Mission 1 benchmarks in detail - Try out the individual missions and look at LLM performance by mission in more detail
GPT-5's performance in detail - A deep dive into the best performing model, GPT-5

LLMs play the Crew

Can language models cooperate without communication?

Results

Success Efficiency

Failure Modes

More