Guess Who? Evaluating AI models with games

We all love interesting AI benchmarks. They might only show part of the picture, but it’s fascinating to see AI models tackle vastly different challenges. From the “usual benchmarks ” , to abstract reasoning , textual games , and even Street Fighter , researchers are always coming up with new ways to measure AI capacity.

This brings us to a nostalgic twist: how about evaluating AI with the game “Guess Who? ,” a beloved board game from the 90s and 2000s?

Guess Who?

In “Guess Who?” two players face each other trying to deduce the identity of the opponents’s character. Starting with a pool of 24 characters, players take turns in asking yes or no questions to reduce the search space. The winner is the first to single out the correct character.

Why is this intriguing for AI evaluation?

To be effective, the agent has to formulate well crafted questions, i.e. the ones that will eliminate more possibilities (reasoning).
Choosing questions and updating the board state rely on the visual inspection of the cards. It is a good benchmark for visual Q&A. Also, there is some subjectivity involved, e.g. does your character have short hair?
Finally there is the model alignment. A model can refuse to answer or use certain information.

Creating the game

The first step was to create the actual game characters. I used gpt-4o to generate descriptions of a diverse set of 24 people. Those descriptions were then used as prompts for the flux image generator model. The result was this set of characters.

A grid of 24 diverse characters organized in 4 rows and 6 columns. — Characters generated for the benchmark. It is possible to create new character sets.

Game loop

The game flow is simple:

A player asks a yes-or-no question.
The opponent answers honestly.
Based on the response, the player updates the board or guesses the opponent’s character.
If unresolved, the other player takes a turn.

Here’s what a turn could look like:

Player A: Is your character wearing glasses?
Referee (Player B): Yes.
Thus, Player A narrows it down to Alice, Grace, Nathan, or Tina.

The Role of the Referee

Truthful responses are crucial, and errors in understanding could disrupt play. A model could even benefit from a mistake or failure. To tackle that problem, we introduced the Referee - an ensemble of models that deliver the actual answers through majority voting. This way we ensure that the game is fair for both players.

AI Players

Each AI player must handle two functions: devising insightful yes-or-no questions and revising their knowledge based on answers.

Formulating Questions

The prompt used to define a question contains a short explanation of the game, the strategy (eliminating more characters), and an image of the current board state. The AI needs to interpret the scene, find critical traits, and formulate strategic questions.

Updating Insights

Upon receiving an answer, the model adjusts its board view. In the previous example, after geting the information that the character uses glasses, the agent checks each character, removing everyone that is not wearing glasses.

In this step, if the model makes a mistake while interpreting the answer, it could exclude the target character. In that case, the opponent can win the match with reason WRONG_GUESS.

A grid of 24 portraits but most of them are blocked by red rectangles. The only characters visible are wearing glasses. — Image that represents the board state and is attached to the prompt. This is the state after updating belief with question `Is the character wearing glasses? Yes`.

Ranking and Replays

As AI matches unfold, we compile rankings with Trueskill and record replays. Take a look at this replay of a memorable match between gpt-4o-mini and gemini-2.0-flash-exp. The latest rankings are available here .

Replay of a lucky match between gpt-4o-mini and gemini-2.0-flash-exp . gpt-4o-mini is able to identify Alice with two questions: Is this person wearing glasses? and Is this person wearing a red shirt?.

Join the fun

There are lots of models to add and matches to run. You can find the code here , and reach out on X if you would like to contribute.