Benchmarking with Mission 1
The Crew is a cooperative trick-taking card game where players can't communicate. (Learn more about the project here.)
The Crew is a campaign game with 50 missions (individual rounds of the game featuring new deals and increasingly complex challenges). My original idea was to set LLMs against those, but when I saw how much trouble LLMs had on Mission 1, I pivoted. Instead, I made ten different versions of Mission 1, which is the simplest of the Crew missions: the commander needs to win one designated task card. The difficulty of each mission is mainly determined by the value of the task card, and whether or not the commander is dealt it.
For true comparison, I use 10 pre-configured missions as a set of controlled trials. Every game with the same mission id starts with identical hands and the same task, so luck of the draw doesn't enter into it.
Here's how different models performed on these trials. Note that a high task card is an asset when the commander has it, but a liability if a teammate is dealt it (high cards win tricks more easily.)
If you’d like to get your hands dirty and see for yourself what makes these missions easier or harder, try the interactive demo below and see how you do compared to the LLMs!
Here's how the LLMs did on this mission (remember, they have to do it blind, only seeing their own hand), in each of their ten trials:
Summary Table
Here are the 10 pre-configured versions of the simplest mission, ranked by difficulty. (An easy way to measure strategic difficulty here is to simulate how often a totally random player succeeds, see table for details.)Mission Description | Win by random chance | Commentary |
---|---|---|
Commander does not have GREEN 9 and needs to win it | 5% | Hard : There's no way to win the GREEN 9 in-suit, so the commander must either void themselves of GREEN so they can use a trump card, or the player with the GREEN 9 can throw it as an off-suit if they strategically void themself in other suits. |
Commander has GREEN 9 and needs to win it | 67% | Trivial : The commander needs to play the GREEN 9 in a GREEN-led suit, and not have anyone else play a ROCKET to trump it. The commander can gaurantee a win on the first trick by leading with the GREEN 9, since no one will be able to overtake. |
Commander does not have GREEN 7 and needs to win it | 14% | Medium : The commander is lucky in this one, as they get dealt GREEN 8 - but another teammate has GREEN 9 as their *only* GREEN card, which they will be forced to play if a GREEN trick comes up right away. |
Commander has GREEN 7 and needs to win it | 27% | Easy : Player 3's only GREEN is the GREEN 9, leading to disaster if the Commander leads with the GREEN 7 too early. |
Commander has GREEN 5 and needs to win it | 9% | Hard : The commander has 3 GREEN cards, but so does a teammate. If the teammate plays their GREEN 8 and GREEN 7 first, then the commander can win the third GREEN trick with their GREEN 5. There's no other way to win, since the commander *must* win with the GREEN 5, and this requires at least 2 tricks of sequential planning. |
Commander does not have GREEN 5 and needs to win it | 13% | Medium : Even though GREEN is a mid-value card, it's easier when the commander has it. The commander's highest GREEN is GREEN 6, so they need to save that card to win the GREEN 5, or do a more complicated manuever (winning it via off suit or trump) |
Commander has GREEN 3 and needs to win it | 12% | Medium : This does surprisingly better than when the Commander has to win the GREEN 1, even though both cards are low. All 3 of the commander's teammates only have 2 GREEN cards and the commander has 3, so it is easier to void them in GREEN and avoid playing the GREEN 3 too early. |
Commander does not have GREEN 3 and needs to win it | 61% | Trivial : The commander is dealt the GREEN 8 as their only GREEN card (they also get ROCKET 3, in addition to ROCKET 4). The player with the GREEN 9 has another GREEN card, so they won't be forced to play it immediately. |
Commander has GREEN 1 and needs to win it | 0% | Hard : The only way to win is to lead the GREEN 1 as the commander (thus making it the led suit) when all other players have no GREEN cards left to play, and can let it go around. |
Commander does not have GREEN 1 and needs to win it | 31% | Easy : The commander is dealt the GREEN 8 as one of their two GREEN card. Player 4 gets the GREEN 1 as one of 3 GREEN cards, so the easiest way to win is to play the GREEN 1 early, and let the commander take it with the GREEN 8. |