If LLMs Had Human-Like Attributes, Then So Does Age of Empires II

"Many studies do not merely observe LLM behavior—they design experiments that assume the very human-like properties they aim to measure."

In a new paper, researcher Adrian de Wynter (affiliated with Microsoft and the University of York) argues that claims about large language models (LLMs) possessing human-like attributes such as understanding, empathy, or self-awareness often rely on flawed methodological approaches.

The paper, titled "If LLMs Have Human-Like Attributes, Then So Does Age of Empires II" (arXiv:2605.31514), presents a pointed critique of how the AI research community measures and ascribes these qualities.

The Core Argument

De Wynter contends that without explicit, substrate-independent measurement criteria, interpretations of LLM behavior depend heavily on the system's interface and observer expectations.

He argues that researchers often assume the very properties they claim to be testing for, creating a circular logic that inflates findings.

The Thought Experiment: Goats, Villagers, and Empathy

To illustrate his point, de Wynter turns to Age of Empires II—a classic real-time strategy game that is, surprisingly, functionally Turing-complete. This means arbitrary computations can be implemented within its mechanics.

He constructed and trained a simple 1-bit bipolar perceptron inside the game, using units like goats and terrain types to represent computational states.

The thought experiment imagines implementing an LLM's computational process within the game: when prompted with "I feel lonely," the in-game system produces an empathetic-sounding response. De Wynter then poses a provocative question:

Would observers attribute understanding or empathy to a system of moving goats and villagers?

He highlights a perceived inconsistency: human-like qualities are often assigned based on interface rather than underlying computation.

Methodological Recommendations

De Wynter advocates for a fundamental shift in how researchers approach these questions:

Adopt a "null" assumption: That LLMs do not uniquely possess human-like attributes.
Design experiments that focus on measurable, causal patterns.
Avoid prematurely labeling observed behavior as evidence of internal states.
Ground claims in observable behavior and scope them carefully.

Implications for AI Deployment

The paper notes a critical real-world concern: as LLMs are deployed in sensitive contexts—healthcare, therapy, customer service—over-attribution of human qualities could lead to misplaced trust or dangerous safety assumptions.

"De Wynter calls for greater rigor in separating observable behavior from interpretive ascription."

Paper Details

The full paper is available on arXiv at:
https://arxiv.org/abs/2605.31514