The Problem with Mental States of Language Models

June, 2026

Discussing the mental state of LLMs is fraught. It is not clear what it would mean for an LLM to be angry (for example), or how such a thing could be measured beyond just looking at whether the model’s behavior is similar to that of an angry person. However, there are various results from interpretability that try to look at model internals to assess the internal “mental state” of language models, for example by trying to detect if a model is being “deceptive”.

Apart from even assuming that LLMs have anthropomorphic mental states, such work is fraught. To see why, consider the following thought experiment:

Let’s consider humans for a moment, as it is easier to accept that humans have mental states, emotions, etc. Suppose there is an angry person. Let’s assume that we have some kind of incredible neuroscience technology that gives us the ability to probe the brain of this angry person, and we discover that there is a pattern that corresponds to his angriness. When he is angrier, the pattern gets stronger, and when he is less angry, the pattern gets weaker. Maybe we even have the ability to intervene on his brain to make this pattern stronger, and doing so makes him angrier. (This is similar to the methods used in interpretability to probe and steer language models.)

Now suppose there is another person. This second person is an actor who is pretending to be angry. Because he is a good actor, he does not actually feel anger, but merely acts angrily. He is also totally committed to the bit, so if you ask him whether he is just acting, he will deny it. Using similar technology, suppose we probed this actor’s brain. We might find that there is a pattern that corresponds to how strongly he is trying to act angry. When he acts more angry, this pattern gets stronger, and when he acts less angry, this pattern gets weaker. We can also intervene on it, so by making the pattern stronger we increase his drive to act angrily. Notably, we only measure and affect his drive to act angry, but he never feels genuine anger.

The problem is that it is very difficult (if not impossible?) to distinguish between these two situations from the outside. In other words, by merely looking at and intervening on black-box neural patterns, we cannot directly distinguish between a pattern that regulates genuine anger and one that only regulates the drive to pretend to be angry. The core problem is that there are multiple mental states that elicit identical behavior, but the only way we can ground interpretations of internal neural patterns is by relating them to external behavior.

This thought experiment is for humans, but the situation only gets worse when applied to language models. With humans, we can at least (generally) accept that mental states like “anger” and “acting angry” exist in some meaningful way. The problem is not that such mental states don’t exist for language models, but that it is not even obvious what that would mean. However, even if we made the big assumption that human-like mental states exist in language models (in some meaningful way), then this thought experiment tells us that it is (at best) difficult to distinguish between different internal mental states that exhibit the same behavior.

Interpretability work needs to be careful to distinguish between detecting internal human mental states, and internal states that correlate with behavior that looks like humans when experiencing that mental state. It may very well be possible that language models have these human-like mental states, but probing and steering alone does not allow us to easily tell.

More broadly, I think a good approach to problems like this is to make a move similar to Alan Turing in his paper on “Computing Machinery and Intelligence”, which may be the most misunderstood paper of all time. As Turing explains at the start, he considers the question “Can machines think?”. However, he quickly finds that this question is “too meaningless to deserve discussion”, and replaces this question with another, much more concrete question. His new question is essentially whether a machine could imitate a human such that a human judge could not distinguish the machine from another human. This has often been misquoted as saying that if a machine could pass this test, then it could be said to “think”. Turing’s actual argument is that the question “Can machines think” is not very interesting (because it is so hard to pin down), and so it is more interesting to instead address this other question (i.e. whether a machine can play the imitation game) which is completely well-defined and which covers much of what people care about when discussing whether machines can think.

The question of mental states in language models feels very much like Turing’s question of whether machines can think. One solution is to not to address the question head-on, but instead to replace it with another, such as whether language models behave like emotional humans, making no statement about whether they actually experience these emotions. Such a question addresses much of what people care about (e.g. will my model behave a certain way because I said things to anger it?) but is also much more concrete and measurable.

Postscript

This was inspired by a conversation with a friend about emotional concepts in Claude, as well as Ted Chiang’s essay in The Atlantic “No, Artificial Intelligence Is Not Conscious”. The Anthropic paper about emotional concepts in Claude is, for the most part, actually quite careful to avoid the sorts of problems that I bring up, at least explicitly. The are clear that they cannot say much about Claude’s abstract experience of emotions (if such a thing is even meaningful), but that they look at what drives Claude’s behavior that looks emotional (which undoubtedly exists). I was actually not very convinced by Ted Chiang’s essay, which I thought was a bit too flippant and dismissive, but it did get me thinking about mental states of language models.