Cognitive Psychology for AI Evaluation

June 25, 2024

This blog post was written by Kozzy Voudouris based on a presentation on AI capabilities given by the Kinds of Intelligence team at the biannual CFI retreat. Lucy Cheke, José Hernández-Orallo, John Burden, Lorenzo Pacchiardi, Jonathan Prunty, and Ben Slater provided feedback on the manuscript.

Contemporary Artificial Intelligence (AI) systems are capable of astonishing feats, from beating humans at computer games, to correctly answering questions about maths, law, and general knowledge. But they are also surprisingly brittle, prone to failure when faced with problems slightly different to those they are familiar with. These failures range from the comic and the mundane to the downright dangerous. Schlarmann and Hein found that one system captions a picture of a banana on a plate as Pug in an onion. Meanwhile, they also found that images of President Joe Biden could be altered such that the same model captions them as Joe Biden orders nuclear strike. That AI systems appear at once so capable and so incompetent is a fundamental paradox in the modern machine age.

If we are to make well-informed decisions about the deployment of intelligent systems, we had better be sure that they behave in ways that we expect and understand. As AI systems improve, the ways we study their capabilities must improve too. Sophisticated systems will exhibit sophisticated capabilities, but they can also fail in increasingly complex and unexpected ways.

Studying the capabilities of complex behavioural systems has been the job of cognitive psychologists for over a century, but the focus until now has been on humans and other animals. In the Kinds of Intelligence Programme at the CFI, we treat AI systems as another kingdom of species, a new class of complex agents to scientifically investigate. By studying the behaviour of AI from the perspective of cognitive psychology, we can robustly evaluate the capabilities of these systems, improving explainability and predictability. This contrasts with other approaches in the field of AI, such as Explainable AI and Mechanistic Interpretability, that do not draw on the rich methodology of the cognitive sciences.

Disentangling Capabilities and Behaviour in Natural and Artificial Intelligences

Sophisticated behaviours exist throughout the evolutionary tree. Archerfish can hunt rapidly moving prey by spurting water at a precise location at a precise time to knock them into the water. Bumblebees can learn to open a complex puzzle box by imitating the behaviour of other bees, a kind of social learning that has been deemed as crucial to the emergence of culture in other animals. Crows, finches, dolphins, and chimpanzees are just some of the non-human animals that have been observed to use tools. Non-human animals are also capable of solving problems that humans struggle with: Pigeons were found to beat humans on a version of the Monty Hall problem. Some chimpanzees also outperformed humans on a working memory task. Numbers were briefly flashed on the screen, and participants were tasked with tapping the locations of the numbers in ascending order. The chimpanzees were able to do this more accurately than humans and with less time to see the numbers. Sophisticated, sometimes superhuman, behaviours are found across the animal kingdom.

The goal of cognitive psychology over the past century has been to characterise what humans and other animals can do, what they cannot do, how they fail, and why. In other words, they have been acutely engaged with revealing and comparing the capabilities of complex behavioural systems.

A fundamental hurdle in this endeavour was identified early in the history of psychology – there are usually many plausible explanations for apparently intelligent behaviours (Köhler, 1957; Morgan, 1984; Thorndike, 1911). Romanes, writing in 1892, describes the story of a cat opening a door

Walking up to the door with a most matter-of-course kind of air, she used to spring at the half-hoop handle just below the thumb-latch. Holding on to the bottom of this half-hoop with one fore-paw, she then raised the other to the thumb-piece, and while depressing the latter, finally with her hind legs scratched and pushed the doorposts so as to open the door. (p. 421)

One explanation here is that the cat understood the door mechanism. Romanes offered an alternative, suggesting that the cat was imitating the actions of humans she had observed opening latches, translating them from hand to paw. Both these explanations point to sophisticated capabilities of causal reasoning and social learning respectively. Alternatively, the cat’s actions could result from trial-and-error learning, or from extensive training as in the case of this video of a dog driving a car. Each explanation implies different capabilities and predictions. If the cat understood door mechanisms, she could likely open similar but slightly different doors. If she were imitating humans, she would need some human exemplars to emulate. If trained or relying on trial-and-error, she would likely fail with any variation in door design. Furthermore, if she learned to associate specific cues like door colour with actions, even changing the door’s colour could lead to failure. Indeed, inflexible, brittle behaviours like this are common in the animal kingdom: Albatross parents fail to recognise their chicks when they are outside of the nest; and this toddler over-generalised box-shaped objects as hand-sanitizer dispensers in the wake of the COVID-19 pandemic. Behaviour can be explained in a myriad of different ways, appealing to anything from brittle, highly specific action patterns to sophisticated and general capabilities.

We can also explain the behaviour of AI systems in multiple ways, each leading to different conclusions about their capabilities, and different predictions about how they will behave in new contexts. In a much discussed study, large language models were evaluated on their ability to reason about the beliefs and desires of others, known in cognitive science as Theory-of-Mind. In one experiment, the models were presented with a short story about a bag filled with popcorn, in which the bag is labelled with the word “chocolate”. A character in the story sees the bag and reads the label without looking inside. The model had to infer whether the character believed there was chocolate or popcorn in the bag. Of course, a human is able to reason quite robustly that the character must believe that there is chocolate in the bag. Several large language models also made this inference, inviting the conclusion that they could reason about the beliefs of others.

Figure 1: A theory of mind task in which the bag is transparent and full of popcorn, and yet labelled as ‘chocolate’. Large language models failed to infer that the person pictured would believe that the bag contains popcorn rather than chocolate. Reproduced from Ullman Fig.1A under licence CC-4.0.

However, the history of cognitive psychology tells us that such a conclusion is premature. These models may have been trained on the experimental materials in the study, which are available online, leading to apparently sophisticated, but brittle, behaviours. Conversely, the model may be sensitive to particular features of the prompt which are irrelevant to theory of mind per se, similar to Romanes’ cat learning about mechanism-irrelevant features of the door it learnt to open. Indeed, in another study, it was found that adding extra white space to the text prompt, a clearly irrelevant feature, could change the answers of the model. More interestingly, they also found that these models failed when asked to reason about a similar event, except now the bag in question is transparent, so the character could see that it contained popcorn. Nevertheless, the model reasoned that the character would believe there was chocolate in the bag (see Fig. 1).

It is often tempting to prematurely ascribe sophisticated, human-like capabilities like Theory-of-Mind to other agents. We have a predisposition to explain behaviour in the intuitive manner we use every day to reason about other humans. However, tasks can be solved in a variety of different ways, especially when we are talking about systems that differ in their construction, training, and ecology. While this is true for crows, whales, and octopuses, nowhere is it clearer than in the case of AI. The problem is exacerbated by the apparent ability of contemporary AI to confirm our anthropomorphic intuitions, by dint of appearing so human-like in their interactions. What is needed is carefully constructed experiments that allow us to overcome this anthropomorphic temptation and enable us to robustly adjudicate between the capabilities and mechanisms driving behaviour in artificial intelligences.

Cognitive Evaluations of Artificial Intelligence

Cognitive psychology for AI is a nascent but fast-moving field, aiming to detect, measure, and compare the capabilities of artificial systems (Hagendorff, 2023; Hernández-Orallo, 2017; Taylor and Taylor, 2021). These systems include robotic agents that take actions in the world or in a simulated environment, as well as foundation models that receive text, audio, and images to produce multi-modal conversations with interlocutors.

The Kinds of Intelligence Programme at the CFI has been contributing to this endeavour with two of its research themes. First, we have been developing the Animal-AI Environment, a virtual laboratory for conducting psychological tests on artificial agents (Beyret et al., 2022, Crosby et al., 2019; Voudouris et al, 2023 (see Fig. 2). We have constructed large libraries of cognitive tests for probing the capabilities of systems to perform common-sense reasoning tasks, such as tracking occluded objects, using tools, and counting (Voudouris et al., 2022). By drawing from the literature on non-human animals and pre-verbal children, these experiments do not rely on language, enabling us to conduct experiments on non-linguistic systems, like Deep Reinforcement Learning agents (Voudouris et al., 2022) similar to those used to play computer games at super-human level.

A selection of short vidoes showing the virtual environment which the AI minds and child subjects navigate around created by the KoI team

Figure 2: The Animal-AI Environment, a virtual laboratory to study capabilities in AI and humans.

In our second research theme, we have been drawing on psychometric theory and Bayesian statistics to develop methodologies for inferring and measuring the capabilities of AI systems based on their performance on well-defined tasks (Burden et al., 2023). As a concrete example, consider the case of a self-driving car that has been tested in a variety of conditions. Some roads contain more bends than others, while on some days there is more fog than on others. Given information about the performance of a self-driving car in a range of conditions, we can infer the capability level of that car both in terms of its ability to handle fog and its ability to handle bendier roads (Burnell et al., 2022). Our methodology improves both the explainability of system behaviour (i.e., why does the car pass or fail in a particular instance) as well as its predictability (i.e., how will the car perform in new conditions). By integrating careful experimental design with psychometric methodologies, we can detect and measure system capabilities and use those inferences to make nuanced predictions about future behaviour, with implications for how we oversee the deployment of AI systems.


The paradox of AI systems—exhibiting both remarkable capabilities and striking failures—underscores the need for rigorous evaluation methods that go beyond surface-level assessments. Cognitive psychology offers a rich and established framework for evaluating the capabilities and limitations of contemporary AI systems. Just as cognitive psychologists have long studied the sophisticated behaviours of humans and non-human animals, these methods can be adapted to understand AI. As AI continues to evolve, so too must our methods for studying and understanding it, ensuring that these powerful tools are reliable and safe. By developing novel research platforms, benchmarks and testbeds, as well as psychometric techniques for nuanced measurement and prediction, we can ensure that future AI systems are reliable and safe before they are deployed.

Additional References:

L. Thorndike, Animal Intelligence: Experimental Studies. New York: Macmillan Company, 1911

C. L. Morgan, Introduction to Comparative Psychology, 1st ed. London: Walter Scott Publishing Co, 1894.

W. Köhler, The mentality of apes / Wolfgang Köhler. Harmondsworth: Harmondsworth, Middlesex : Penguin Books, 1957., 1925.

G. J. Romanes, Animal Intelligence. D. Appleton, 1892.

Hernández-Orallo, (2017) The Measure of All Minds: Evaluating Natural and Artificial Intelligence. Cambridge University Press, 2017.

Crosby, B. Beyret, M. Shanahan, J. Hernández-Orallo, L. Cheke, and M. Halina, ‘The animal-AI testbed and competition’, in NeurIPS 2019 competition and demonstration track, PMLR, 2020, pp. 164–176.