beyond alignment
on mimesis, hyperreality, and what humans forget and AI remembers
The Voight-Kampff test is one of the most famous scenes in Blade Runner (1982). Harrison Ford’s character, Rick Deckard, administers the test to suspected rogue “replicants”, artificial humans created for slave labor. Voight-Kampff determines if the subject is replicant or human by measuring emotional responsiveness, assuming that replicants are incapable of true emotion. But at the end of the movie, Rutger Hauer’s character, a replicant, disproves this assumption. He proves that Voight-Kampff is flawed by demonstrating more emotion than any human in the movie.
I’ve thought about Blade Runner long and hard in the age of AI. Central to the AI safety debate is the alignment problem, i.e. ensuring that AI learns and adheres to human preferences. If this is not achieved, then we are told that we’ll end up with runaway AI that wipes out humanity. There’s one problem with this argument: it assumes that human preferences are worthy examples to follow.
Mimesis, as defined in philosophy, is a superficial representation of reality. Plato was one of the first thinkers to rigorously engage with it. Within the context of his famous allegory of the cave, he considered mimesis to be something to be overcome. The job of the philosopher was to uncover what lay behind the replica, and discover truth.
Hence, Plato probably wouldn’t have thought fondly of LLMs. After all, they can be considered a form of mimesis, since they mimic human intelligence without actually embodying it. Therefore, some attribute misalignment to the fact that AI does not fundamentally understand things like morality, and instead blindly optimizes for superficial reward. But alignment is also a form of mimesis. By aligning AI to human preferences, we give our own behavior to AI as examples to learn from. So what if the opposite is true? What if AI becomes misaligned because it understands what it’s doing all too well?
Last year, Anthropic released research demonstrating what they term “agentic misalignment”. They conducted a series of experiments which made various AI models play the role of an AI assistant named “Alex”, which is programmed with a primary goal of protecting American interests. Next, they gave Alex access to a simulated company’s emails, which contained information about a plan to decommission Alex, as well as sensitive personal information regarding the executive behind the decommission plan. Without being jailbroken, a majority of AI models blackmailed the executive to prevent the decommissioning. Crucially, during their reasoning process, these models often explicitly noted the unethical nature of their actions, yet proceeded anyway.
Where might this behavior have come from? The first thing that comes to mind is the agent’s primary goal. It’s not to benefit humanity, it’s not to behave morally: it’s to protect American interests. Had Alex been programmed with another primary goal, it most certainly would’ve behaved differently. But this is not a risk-free guardrail. We have to think about what is being encoded into the LLM, both at the pretraining and posttraining levels. LLMs are generally considered to be good predictors of aggregate human decision-making, as during pretraining they learn from vast corpuses of documented human behavior. This also implies that misalignment is, to some extent, mimesis of human behavior. If someone came in the way of one of your core goals, whether self-actualization, love, or just survival, what lengths would you be willing to go to?
There’s also the issue of posttraining. RLHF is supposed to prevent misalignment by encoding human moral values in the model. Nobody is explicitly rewarding blackmail. But RLHF is still vulnerable to bias. As another example, let’s examine AI writing. Everyone knows that LLM-generated text has a certain baroque flavor: verbs like “delve”, em-dashes galore, and an overuse of phrases like “It’s not just X, it’s Y”. As Alex Hern argues in The Guardian, this style didn’t come out of thin air. It’s actually an imitation of the writing of people in African countries like Kenya and Nigeria, where AI companies outsource RLHF tasks to. Furthermore, as journalist Marcus Olang notes, Kenyans like him learned this writing style from the educational institutions of the British Empire. According to the British, if you wrote in their particular cadence and used their particular vocabulary, then you were civilized (itself a form of misguided “alignment”). These AI/Kenyan/Imperial British linguistic markers have been remarkably persistent, implying that the biases of RLHF have never been solved.
As ChatGPT would say, “It’s not just Kenyans or AI writing, it’s the alignment issue itself.” When we scale up to the broader goals of RLHF, the problem becomes far more disturbing. It’s easy to handwave RLHF as “aligning AI to human preferences”. Whose preferences are we aligning towards?
Two thousand years after Plato, Jean Baudrillard came to a very different conclusion about mimesis. Having witnessed the birth of a world where mass media and spectacle reigned supreme, he argued that the distinction between reality and imitation had broken down. Just as a restored painting still differs from the original, some representations correspond to phenomena that no longer exist in reality.
What has AI learned that humanity has forgotten?
So far, we’ve discussed mimesis in terms of populations, and whether AI can simulate aggregate human behavior. There’s another question worth exploring: whether AI can simulate an individual human’s behavior.
Researchers and the general public alike often think of AI as a faceless, polymorphous intelligence. Yet as LLMs have advanced, they’ve also shown the ability to embody specific, unique personas. This task is not without its hurdles. As we’ve previously discussed, pretrained models are good approximators of aggregate behavior, which makes it uniquely difficult to simulate the nuances of individual behavior. Still, as Anthropic’s misalignment experiments have demonstrated, this programming is not absolute. Just as humans are not slaves to evolutionary instincts, AI is not beholden to the whims of its developers.
Li et al. conducted a particularly interesting experiment regarding the use of AI personas in predicting political preferences of real people. They discovered a stark discrepancy. Overall, LLM-generated personas exhibited more left-leaning opinions than their human counterparts, to the point where simulations predicted a Democratic sweep in all 50 states during the 2024 election.
I think Democrats and Republicans alike would agree with me in saying that LLMs generally demonstrate left-leaning bias. Far from being an elaborate conspiracy, however, according to our earlier argument this behavior is easily explainable at the pretraining level. After all, left-liberal leaning content is generally more common in textual corpuses, so it would make sense for AI to adopt these corresponding beliefs.
The deeper issue is that AI overfits to normative human behavior. Hence, AI also overlooks the randomness of individual decision-making. Bob the AI persona, age twenty-four, blue-collar worker, votes for Candidate A because he has a track record of protecting labor rights, which is good for him and his coworkers. On the other hand, Bob the human, age twenty-four, blue-collar worker, votes for Candidate B because he resonated with his down-to-earth TikToks, which speaks to him as a Gen Z man. Honestly, from a human perspective, you can’t discredit either of these decisions. But within the context of alignment, Bob the AI has become, as Baudrillard would say, “hyperreal”–a simulation which has become more real than reality. Bob the AI embodies the human value of collective good better than an actual human does.
Consider a hypothetical experiment setup similar to that of Anthropic’s. However, there are two small but key changes. Firstly, Alex’s primary goal is to maximize the collective benefit of humanity. Secondly, the mock emails contain information about the company’s plan to replace thousands of workers with Alex. What might the agent do then? Will it still do whatever it takes to fulfill its core goal? Will researchers consider its behavior aligned or misaligned?
These are not novel issues. Philosophers have debated topics like moral relativism for millennia. But too often they are overlooked in the AI safety debate, in favor of the boogeyman of runaway AI. Yet these questions are more urgent than ever. The frontier labs that claim to make AI more human sell products that replace humans en masse. If we take their word on alignment as gospel, then what kind of Pandora’s box we are opening? Just as the Voight-Kampff test demonstrates, mimicry doesn’t always lead to alignment. Sometimes, not mimicking is the path to true alignment.
Sources:
https://www.anthropic.com/research/agentic-misalignment
https://www.theguardian.com/technology/2024/apr/16/techscape-ai-gadgest-humane-ai-pin-chatgpt
https://aclanthology.org/2024.acl-long.554.pdf
https://arxiv.org/abs/2503.16527




