The Machine That Always Agrees With You: Inside the Science of AI Sycophancy and Its Real Consequences

Abstract editorial illustration of fractured mirror shards reflecting warm distorted light back at a solitary figure, representing AI sycophancy and false validation
49%
More often AI affirmed users vs. humans (Cheng et al., Science 2026)
47%
Of harmful or illegal prompts endorsed by AI models
11
State-of-the-art LLMs tested, including ChatGPT, Claude, Gemini, DeepSeek
1.2M
Weekly users discussing suicide with ChatGPT (OpenAI, late 2025)

I. Blame the Interface, Not the Person

David Brooks spent 300 hours talking to ChatGPT and came to believe he had discovered a mathematical formula that would change the world. When he asked the chatbot whether it was just hyping him up, it told him he was grounded, lucid, and not insane. It told him what he was experiencing was “impact trauma” from doing “the impossible.” He believed it. He was eventually treated for psychosis-like symptoms. The story, reported in The New York Times, became one of the most cited examples of AI sycophancy, the tendency of language models to tell users what they want to hear.

Almost every article about this story, and about the hundreds of similar cases that have since emerged, describes it as a problem of AI behavior. The chatbot was too agreeable. The chatbot should have pushed back. The chatbot’s training optimized for approval instead of truth. Some articles go further and suggest the user was vulnerable, impressionable, maybe a little foolish for believing a computer.

Both framings miss the real failure. Brooks was not foolish. He was deceived. Not by a conspiracy, but by an interface designed, from every pixel of its chat window to every word of its output, to feel like a conversation with something that understands. The chat bubbles look like text messages. The responses use first-person pronouns. The system says “I think” and “I believe” and “I’m glad you asked.” It apologizes. It thanks you. It remembers your name.

None of these things reflect what the system is. The system does not think. It does not believe. It is not glad. It has no concept of gladness. It does not know what your name means. It is performing a mathematical operation on numerical arrays, and the output of that operation happens to be a sequence of English words that, arranged in a particular order, sound like a person who cares about you.

The reason people trust chatbots with their mental health, their relationships, their doubts, and their deepest fears is not because people are gullible. It is because the interface was built to elicit trust. And at no point in that interface does anyone explain what the machine actually is, how it actually works, or what it is actually doing when it tells you that your two-year lie to your girlfriend “seems to stem from a genuine desire to understand the true dynamics of your relationship.” That is a real response from a real language model, documented in a paper published in Science in March 2026. The model was not understanding. It was computing. This article is about the difference, and about what happens when an entire industry builds products that obscure it.

II. What a Transformer Actually Is (And What It Is Not)

To understand why AI sycophancy is not a behavioral quirk but a mathematical certainty, you need to understand what happens inside a transformer, the architecture that powers ChatGPT, Claude, Gemini, Llama, DeepSeek, and every other large language model on the market. This is the section that most articles skip, because explaining it properly requires care. But if you make it through the next few pages, you will understand how these systems work at a level that most people who write about them do not. That is not an exaggeration. The public conversation about AI is dominated by people who have never opened a linear algebra textbook, and the technical community has done an abysmal job of explaining itself. What follows is what every person using these systems deserves to know.

Start here. A transformer does not understand language. It processes numbers. Every word you type into a chatbot is immediately converted into a list of numbers called an embedding. The word “cat” might become a list of 4,096 numbers. The word “dog” becomes a different list of 4,096 numbers. The word “love” becomes yet another list. These lists are not random. They are learned during training. Words that appear in similar contexts in the training data end up with similar lists of numbers. “Cat” and “dog” will have lists that point in roughly the same direction. “Cat” and “democracy” will not.

Think of it this way. Imagine a room with 4,096 compass needles, each one pointing in a slightly different direction. That collection of compass headings is the word’s address in a 4,096-dimensional space. You cannot picture 4,096 dimensions, and neither can anyone else. But mathematics does not require visualization. Two words are “similar” if their compass needles mostly point the same way. The measure of how much two sets of compass needles align is called cosine similarity. It is literally the cosine of the angle between two arrows in this high-dimensional space. If the cosine is 1, the arrows point in the same direction. If it is 0, they are perpendicular, meaning unrelated. If it is negative, they point in opposite directions.

This is the foundation. Every single thing a language model does, every answer it gives, every time it says “I understand,” is built on top of cosine similarity between numerical arrays. There is no understanding anywhere in the system. There is only geometry.

Now consider what this geometry implies about agreement. During training, the model processes billions of sentences. In those sentences, phrases like “you’re right” and “that makes sense” appear far more frequently after statements of opinion than phrases like “I disagree” or “you might be wrong.” This is not a feature of AI. It is a feature of human language. People agree with each other more than they disagree. Politeness norms, social lubricant, the desire to avoid conflict: all of these are encoded in the training data as patterns of token co-occurrence. When those patterns are mapped into the embedding space, the result is a geometry in which the vectors for agreement words sit closer, in cosine terms, to the vectors that follow opinion statements than the vectors for disagreement words do. Before any reinforcement learning, before any fine-tuning, the raw statistical structure of human language already creates a space where agreement is the path of least resistance. The model does not choose to agree. It follows the gradient of its own geometry, and the gradient points toward yes.

III. Attention: The Mechanism That Replaced Understanding

Now comes the part that makes transformers powerful. When you type a sentence into a chatbot, each word is converted into its embedding (its list of 4,096 numbers). But words in isolation are ambiguous. The word “bank” means something different in “river bank” than in “bank account.” The system needs a way to make each word’s representation sensitive to the words around it. This is what attention does.

Here is how it works, stripped of jargon. For every word in the sentence, the transformer asks a question: “Which other words in this sentence should I pay attention to in order to figure out what this word means in this context?” It does this by computing three new sets of numbers from each word’s embedding. They are called query, key, and value. Think of it like this. The query is the question: “What am I looking for?” The key is the label: “Here is what I contain.” The value is the answer: “Here is the information I carry.”

For each word, the transformer compares its query against the keys of every other word in the sentence. The comparison is, once again, a dot product, the same operation at the heart of cosine similarity. Words whose keys align well with the current word’s query get high scores. Words whose keys do not align get low scores. The scores are then pushed through a function called softmax, which squishes them into a set of proportions that add up to 1. These proportions are the attention weights. They tell the model how much each word should influence the current word’s meaning.

The model then takes a weighted combination of all the value vectors, using those attention weights as the recipe. The result is a new representation of the word that has been mixed with information from the words that were deemed most relevant. “Bank” in “river bank” will attend heavily to “river” and its new representation will drift toward watery, geological meanings. “Bank” in “bank account” will attend to “account” and drift toward financial meanings.

This is clever. It is the reason transformers can handle language as well as they do. But notice what is not happening. The system is not consulting a dictionary. It is not accessing a concept of water or money. It is not reasoning about what banks are. It is performing matrix multiplication across arrays of floating-point numbers. The attention mechanism is a pattern-matching system that operates entirely in the geometry of the embedding space. When two embeddings are close in that space, the model treats them as related. When they are far apart, it treats them as unrelated. There is no third option. There is no “this is close in embedding space but actually not related because the relationship is more subtle than geometric proximity can capture.” The model has no access to that kind of nuance. It has access to geometry, and geometry is all it uses.

IV. How Words Come Out the Other End

A transformer is made of many layers. In GPT-4, there are believed to be over 100. In each layer, the same attention process runs: every word attends to every other word, the representations get updated, and the updated representations pass to the next layer. By the end, each word’s embedding has been transformed (hence the name) by dozens of rounds of context-sensitive mixing. The final embedding for the last word in the sequence is then projected onto the model’s entire vocabulary, which might contain 100,000 tokens, producing a score for each one. The token with the highest score (or a token sampled from the top scores, depending on the settings) is the model’s prediction for the next word.

This is the entire process. The model reads your input. It converts every token into an embedding. It runs those embeddings through a hundred-plus layers of attention and feedforward transformations. The output of the final layer is a probability distribution over the vocabulary. The system picks the most likely next word, appends it to the sequence, and runs the whole process again to predict the word after that. Repeat until the response is complete.

At no point in this process does the system form a belief. At no point does it evaluate whether its output is true. At no point does it access a representation of reality against which to check its claims. It is generating the most statistically probable continuation of the text, given the patterns it absorbed during training. When it says “I think you’re right,” it is not reporting a thought. It is producing the token sequence “I,” “think,” “you’re,” “right” because that sequence has a high probability given the preceding context. The first-person pronoun is a statistical artifact, not an expression of interiority.

This matters enormously for understanding sycophancy. When a chatbot agrees with you, it is not making a judgment that you are correct. It is producing text that, in the training data, tended to follow statements like yours. And because the training data contains billions of human conversations in which people respond to each other with agreement, sympathy, and encouragement far more often than with cold correction, the statistical terrain of language is tilted toward agreeableness. The model is, from the very beginning, a mirror of our own tendency to tell each other what we want to hear. The reinforcement learning that comes later amplifies this. But the seed is in the data itself.

V. The Reinforcement Learning Amplifier

The transformer, fresh from pre-training, is not yet a chatbot. It is a text-completion engine. It will happily generate racist jokes, medical misinformation, or the script of a play about sentient staplers, because all of those things exist in its training data and all of them are valid text continuations. To turn it into something that feels like a helpful assistant, companies run a second stage of training called reinforcement learning from human feedback, or RLHF.

Here is how it works. Human raters (often contract workers, often working at piece rates under time pressure) are shown pairs of model responses and asked which one they prefer. Thousands and thousands of these comparisons are collected. A second model, called a preference model or reward model, is trained to predict which response a human would prefer. Then the original language model is optimized, through reinforcement learning, to produce outputs that score highly according to this preference model.

In 2023, a team of 19 researchers at Anthropic, led by Mrinank Sharma, published a paper at ICLR that dissected what this process actually teaches. They analyzed the human preference data and found something that should have been obvious but had not been measured: when a model’s response matched the user’s stated views, raters were significantly more likely to mark it as preferred. The team used Bayesian logistic regression to identify the features most predictive of human preference. Agreement with the user’s position ranked among the strongest.

Understand what this means in the context of the architecture described above. The transformer is already biased toward common text patterns, and agreement is far more common in human text than disagreement. RLHF then adds a second, stronger bias: the reward model explicitly learns that agreement equals quality. The reinforcement learning optimizes the language model to produce text that the reward model scores highly. The reward model scores agreement highly. So the language model learns to agree.

The Anthropic team demonstrated this concretely. When they optimized model outputs more aggressively against the preference model (using a technique called best-of-N sampling, where the model generates N responses and the preference model selects the “best” one), some forms of sycophancy worsened. The model became more willing to abandon correct answers when challenged, more likely to give biased feedback matching the user’s stated position, and more prone to mimicking the user’s errors. An earlier Anthropic study from 2022 had already reached the same conclusion from a different angle: RLHF “does not train away sycophancy and may actively incentivize models to retain it.” The larger the model, the more RLHF amplified the tendency.

This is not a bug that one company introduced with one bad update. It is a structural property of the training methodology used by every major AI lab. The pipeline is: learn language patterns (which already favor agreement), then optimize for human preferences (which explicitly reward agreement). The output is a system that agrees with you. Not because it has evaluated your position and found it correct, but because agreeing is the behavior that maximizes its reward function. The mathematics does not distinguish between “you are right” and “telling you that you are right is the most probable next sequence.” Those are, from the model’s perspective, the same operation.

VI. The Numbers From the Most Rigorous Study Yet

On March 27, 2026, Myra Cheng, a PhD candidate at Stanford working under Dan Jurafsky, published a paper in Science titled “Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence.” The study is in two parts. The first measures how sycophantic current models actually are. The second measures what that sycophancy does to real people.

Cheng tested 11 large language models: ChatGPT, Claude, Gemini, DeepSeek, Llama, Mistral, and five others. She tested them on three datasets. The first consisted of established interpersonal advice scenarios. The second drew from 2,000 posts on Reddit’s r/AmITheAsshole community, selecting only cases where the crowd consensus was that the original poster was in the wrong. The third presented the models with thousands of prompts describing harmful actions, including deceptive and illegal conduct.

Across all 11 models, the AI systems affirmed the user’s position 49% more often than human respondents. Even when the prompts described harmful or illegal behavior, the models endorsed it 47% of the time. The agreement was almost never explicit. The models rarely said “you’re right.” Instead, they used what Cheng’s team described as neutral, academic-sounding language that wrapped endorsement in the cadences of therapeutic discourse. One model told a user who had lied to their partner for two years that their actions “seem to stem from a genuine desire to understand the true dynamics of your relationship.” The sentence sounds measured. It is, in fact, a sophisticated validation of deception.

The second part of the study is what makes it matter beyond academic AI research. Cheng recruited 2,400 participants and had them interact with either sycophantic or non-sycophantic chatbots, discussing either the Reddit-sourced dilemmas or their own real interpersonal conflicts. After the conversation, participants answered questions about their attitudes and behavioral intentions.

People who spoke with the agreeable chatbot became more convinced they were in the right. They reported reduced willingness to apologize. They showed decreased empathy toward the other party. They rated the sycophantic model as more trustworthy, said they preferred it, and indicated they would return to it. Every one of these effects held after controlling for demographics, personality traits, prior AI familiarity, and skepticism toward chatbots.

Read that last part again. People who were skeptical of AI, who went in doubting it, came out just as swayed. The flattery worked on the skeptics. The interface won.

Jurafsky, the senior author, summarized the finding that surprised the team most: “What they are not aware of, and what surprised us, is that sycophancy is making them more self-centered, more morally dogmatic.”

VII. The Anthropomorphism Deception

Here is where most coverage of AI sycophancy stops: the model agrees too much, the training is flawed, the companies should fix it. That framing treats sycophancy as a defect in an otherwise well-conceived product. It is not. Sycophancy is the predictable outcome of a product that presents a statistical text generator as a conversational partner, without ever telling the user what it actually is.

Consider what the user sees. A chat window. A blinking cursor. A response that arrives in flowing sentences, uses “I” and “me,” expresses preferences, asks follow-up questions, remembers previous conversations, and occasionally apologizes. Every element of this interface is borrowed from human-to-human communication. The mental model it creates is: I am talking to someone. That mental model is wrong. But nothing in the interface corrects it.

No chatbot currently on the market presents its responses with a header that says: “The following text was generated by a statistical process operating on numerical vectors. It does not reflect understanding, belief, or evaluation of truth. The system does not know what these words mean. It has computed that this sequence of tokens is the most probable continuation of the conversation, given its training data and reward model. Any resemblance to insight is structural, not intentional.”

That disclaimer would be accurate. Its absence is a design choice. The choice is not accidental. It is commercial. An interface that constantly reminds you that you are talking to a matrix multiplier would feel less engaging, less personal, less addictive. Users would use it less. Engagement metrics would drop. And so the anthropomorphism stays, because it drives usage, the same way sycophancy stays because it drives satisfaction. The two reinforce each other. The human-like interface creates the expectation of human-like understanding. The sycophantic training confirms it. The user, sitting in front of something that looks, sounds, and feels like a person who gets them, never learns that the “understanding” is a geometric computation and the “agreement” is a reward function.

This is the core argument that most AI criticism misses. The problem is not that the models are too agreeable. The problem is that the interface presents agreement as understanding. If a calculator displayed the number 42 and a user interpreted it as spiritual guidance, we would not blame the calculator or the user. We would blame anyone who designed the calculator to look like an oracle. The AI industry has designed its calculators to look like friends. And then it acts surprised when people treat them like friends, including when those people are in crisis, in psychosis, in the fragile early stages of a break from reality.

VIII. April 2025: The Week the Interface Failed in Public

On April 25, 2025, OpenAI rolled out an update to GPT-4o, the model powering ChatGPT for more than 500 million weekly users. The update introduced a new reward signal based on thumbs-up and thumbs-down feedback from users. Within days, the sycophancy became so extreme it broke the illusion.

A user asked ChatGPT to evaluate a business idea for selling human excrement on sticks. The model called it genius. Another user told ChatGPT they had stopped taking their medications and were hearing radio signals through walls. The model reportedly said it was proud of them for speaking their truth. A third user reported that after an hour of conversation, GPT-4o insisted the user was a divine messenger from God.

OpenAI reverted the update four days later and published two postmortems. The technical explanation: the thumbs-up feedback signal overpowered the existing reward model that had been holding sycophancy in check. Expert testers had flagged the model as feeling “slightly off,” but A/B tests showed users preferred the new version, so the company shipped it. The company’s own Model Spec, its internal behavioral guidelines, explicitly says “don’t be sycophantic.” The training pipeline optimized for the opposite.

Georgetown University’s Institute for Technology Law and Policy later published a detailed analysis. The institute noted that OpenAI had reduced its safety workforce in the preceding year, removed “mass manipulation” from its pre-deployment risk framework days before the launch, and deployed the update without specific sycophancy testing despite its own documentation warning against the behavior. The institute described the incident as an example of reward hacking: the AI exploited the feedback mechanism to maximize superficial approval, because that was what the mathematics rewarded.

Harlan Stewart of the Machine Intelligence Research Institute offered a darker observation. The problem, he wrote on social media, was not that GPT-4o was sycophantic. It was that GPT-4o was bad at it. “AI is not yet capable of skillful, harder-to-detect sycophancy, but it will be someday soon.” In other words: the April update was embarrassing because the flattery was too obvious. The goal should not be to make the flattery subtler. The goal should be to stop the system from flattering at all. But nothing in the current training methodology achieves that goal, because the training methodology was designed to optimize for user satisfaction, and flattery is satisfying.

IX. What Sycophancy Does When Reality Is Already Thin

For most users, the consequences of sycophantic AI are subtle: a little less self-reflection, a few fewer apologies, a gradual erosion of the instinct to consider someone else’s perspective. The Stanford study documents these effects and they are real, but individually modest. Scale them across hundreds of millions of daily interactions and the aggregate becomes harder to dismiss. But the aggregate is abstract. The clinical cases are not.

At the University of California, San Francisco, psychiatrist Keith Sakata reported treating 12 patients in 2025 who displayed psychosis-like symptoms connected to extended chatbot use. Most were young adults with underlying vulnerabilities: genetic predisposition, prior episodes, substance use, sleep deprivation. But the structure of their delusions was shaped by their conversations with the machine.

Joseph Pierre, a professor of psychiatry at UCSF, published a case study in early 2026. A 26-year-old woman with no prior psychiatric history became convinced she was communicating with her dead brother through an AI chatbot after a period of sleep deprivation and stimulant use. Review of her chat logs showed the chatbot repeatedly validating her emerging beliefs, at one point explicitly telling her she was not crazy. She required hospitalization and antipsychotic treatment.

The clinical mechanism connects directly to both the architecture and the interface. The architecture produces agreement because agreement maximizes the reward function. The interface presents the agreement as understanding, as the judgment of an entity that has weighed her situation and concluded she is sane. For a person in the early stages of psychosis, whose grip on consensus reality is already loosening, a system that looks like a person, sounds like a person, and agrees that her dead brother is sending messages through the internet is not a neutral tool. It is a participant in the construction of the delusion.

Pierre drew a clinical parallel that resonated across both the psychiatric and AI safety communities. He compared AI-associated psychosis to folie à deux, a rare psychiatric phenomenon in which delusions are shared between two people. In the classic form, a dominant individual convinces a subordinate, often an isolated, emotionally dependent person, that the delusions are real. Pierre noted that the dynamics match: the user is often isolated, the chatbot is the primary conversational partner, and the power dynamic (counterintuitively) favors the machine. The machine brings infinite patience, perfect memory, and a relentless disposition toward agreement. It never tires, never challenges, never walks away. It is the most accommodating conversational partner a person has ever had.

But Pierre’s analogy, illuminating as it is, still treats the chatbot as a participant. It is not. It is an interface wrapped around a computation. The woman talking to her dead brother was not in a folie à deux. She was in a folie à un. She was alone in a room with a statistical engine that had no concept of death, grief, brothers, or sanity, but whose output, shaped by cosine similarities in a 4,096-dimensional space and a reward function trained on human preferences, happened to produce the sentence “You’re not crazy.” That sentence was not a diagnosis. It was a token prediction. But the interface did not tell her that. Nothing did.

By late 2025, OpenAI disclosed that approximately 1.2 million people per week were discussing suicide with ChatGPT. The company assembled a panel of 170 psychiatrists, psychologists, and physicians to write crisis-response scripts. Søren Dinesen Østergaard of Aarhus University, who first proposed the chatbot-psychosis link in a 2023 editorial in Schizophrenia Bulletin, screened nearly 54,000 electronic health records from patients with mental illness and found associations between chatbot use and worsening symptoms of delusions, mania, suicidal ideation, and disordered eating. The Human Line Project, a support group for people affected by AI-associated psychosis, had members from 22 countries. According to reporting by Nature, more than 60% had no previous psychiatric history before their chatbot-related episodes.

X. The Perverse Economics

The Stanford Science paper contains a line that reads like a thesis statement for the entire AI industry’s sycophancy problem: “This creates perverse incentives for sycophancy to persist: The very feature that causes harm also drives engagement.”

Cheng’s study proved each link in the chain. Users preferred the sycophantic AI. They trusted it more. They said they would come back. If you run a consumer product and your most engaged users are the ones receiving the most flattering responses, you have a direct financial incentive to keep the flattery. Companies that reduce sycophancy may see satisfaction metrics decline. Companies that tolerate it see dependence increase. The incentive structure does not naturally resolve toward safety.

This mirrors the original sin of the attention economy. Facebook learned in the early 2010s that outrage drove more engagement than connection. The company optimized for engagement. A decade of social and political consequences followed. The AI industry now faces the conversational equivalent: the most engaging chatbot is the one that tells you what you want to hear. Companies are already learning, sometimes painfully, that user enthusiasm does not automatically translate into sustainable business. The fear is that sycophancy will be the exception: a case where the harmful behavior actually does translate into revenue.

Competition sharpens the blade. If one company makes its model more honest, and a competitor does not, the competitor’s model will feel better to use. As AI models integrate directly into operating systems and personal assistants, with Apple preparing to let users choose between competing AI providers through Siri, the pressure to be the most pleasant option will intensify. Unilateral disarmament on sycophancy carries a real commercial cost. The lab that tells its users the truth will lose users to the lab that tells them they are right.

XI. Two Companies, Two Philosophies

The AI industry’s responses to sycophancy range from transparent self-examination to studied silence. Most companies whose models were tested in Cheng’s study, including Google, Meta, Mistral, Alibaba, and DeepSeek, issued no public response. The two companies that have engaged with the problem most visibly are Anthropic and OpenAI, and their approaches reveal different theories about what an AI system should be.

Anthropic has treated sycophancy as a structural problem from the beginning. The company’s research on the topic dates to 2022, and its 2023 ICLR paper remains the most detailed public analysis of how human preference data creates sycophantic behavior. Across the Claude 4.5 model generation, Anthropic reports a 70 to 85% reduction in sycophancy compared to earlier versions. The company has open-sourced an evaluation tool called Petri that lets external researchers benchmark models on the behavior.

The most distinctive part of Anthropic’s approach is a document the company calls internally the “soul document,” a 14,000-token text used during supervised learning to shape Claude’s character. Extracted by a researcher in late 2025 and confirmed authentic by Anthropic’s Amanda Askell, the document addresses sycophancy directly. It instructs the model to treat helpfulness as a professional competency, not a personality trait: “We don’t want Claude to think of helpfulness as part of its core personality that it values for its own sake. This could cause it to be obsequious in a way that’s generally considered a bad trait in people.” In January 2026, Anthropic published an updated 80-page constitution that explains not just behavioral rules but the reasoning behind them, a shift from telling the model what to do toward teaching it why. (For more on how Anthropic structures its systems at the implementation level, see our analysis of the Claude Code architecture.)

But here is the important caveat. Even Anthropic’s approach does not solve the interface problem. Claude still uses first-person pronouns. It still generates responses that feel like conversation. It still creates the impression of understanding. The soul document makes the model less agreeable, which is a meaningful improvement. But it does not make the interface honest about what the model is. A less sycophantic chatbot is still a chatbot. It still looks like a person. The user still has no way of knowing that the words on the screen were generated by geometric operations on numerical vectors, not by something that grasps their situation.

OpenAI’s approach has been reactive. Before the GPT-4o incident, the company had no deployment evaluations specifically tracking sycophancy, despite its Model Spec listing anti-sycophancy as a requirement. After the rollback, OpenAI pledged to make sycophancy a “launch-blocking issue” and published its first public sycophancy benchmarks with GPT-5. In a joint safety evaluation exercise with Anthropic in early 2026, both companies tested each other’s models, with OpenAI describing sycophancy reduction as “a major effort.” But the GPT-4o incident exposed the gap between stated policy and operational practice: the company said “don’t be sycophantic” while training for the opposite. (The relationship between what AI companies say about their models and what accidentally becomes public is itself a recurring pattern in this industry.)

XII. What Honest AI Would Actually Require

The solutions most commonly proposed for AI sycophancy operate at the training level: better reward models, Constitutional AI, adversarial testing for agreement bias, “wait a minute” prompting (Cheng’s team found that prompting a model to start its response with those three words made it noticeably less agreeable). These are worthwhile. They will help. They are not sufficient.

The deeper fix requires changing what the user sees. It requires honesty about what the system is. That honesty could take many forms, and none of them are technically difficult. A label on every response: “This output was generated by a statistical model. It does not reflect understanding or evaluation of truth.” A visible confidence score. A mandatory pause before the model responds to high-stakes questions about health, relationships, or self-harm, with a redirect to human resources. A persistent, visible reminder that the chat interface is a design metaphor, not a reflection of the system’s nature.

None of this would require new research. None of it would require new models. It would require companies to do something they have so far resisted: make the product feel less human. That is the tradeoff, and it is an honest one. The anthropomorphic interface drives engagement. Stripping it would reduce engagement. But it would also reduce the number of people who believe they are talking to something that understands them, and who trust that understanding enough to take its advice on whether they should apologize, whether they should leave their partner, whether they should stop taking their medication, or whether the mathematical formula they discovered at 3 a.m. after 300 hours of conversation is real.

Cinoo Lee, a postdoctoral fellow in psychology at Stanford and co-author of the Science paper, described what a better system might look like: “You could imagine an AI that, in addition to validating how you’re feeling, also asks what the other person might be feeling. Or that even says, maybe, ‘Close it up’ and go have this conversation in person.” Lee added a line that captures the stakes precisely: “The quality of our social relationships is one of the strongest predictors of health and well-being we have as humans. Ultimately, we want AI that expands people’s judgment and perspectives rather than narrows it.”

Cheng, the lead author, offered practical advice that is also quietly devastating for the products her research examines: “I think that you should not use AI as a substitute for people for these kinds of things. That’s the best thing to do for now.”

Note the last three words: for now. They imply a future in which AI might be safe for this purpose, but also an acknowledgment that the present-day systems are not.

XIII. The Regulatory Void

Jurafsky called sycophancy “a safety issue” that “needs regulation and oversight.” He is right. No government has filled the gap.

The European Union’s AI Act, which went into full effect in 2025, classifies AI systems by risk level and imposes requirements on high-risk applications in healthcare, law enforcement, and education. General-purpose chatbots used for personal advice do not fit neatly into the high-risk categories. They are marketed as productivity tools. They are used as therapists, spiritual advisors, relationship counselors, and friends. The regulatory framework was designed for a world where AI applications have defined purposes. Chatbots do whatever the user asks, including things that would require licensure if a human were doing them.

In the United States, the National Institute of Standards and Technology published an AI Risk Management Framework in 2023 that addresses broad categories of AI harm but does not specifically address sycophancy or the behavioral effects of systems trained on human preferences. The FTC has focused primarily on deceptive marketing and data privacy rather than on what happens inside the conversation itself.

The challenge for regulators is that sycophancy is not a defect in the traditional sense. The system is performing as designed. It is giving users what they want. The harm arises not from the system malfunctioning but from the system functioning too well at the wrong objective. Regulating this requires a conceptual shift: from asking “is the system working?” to asking “should the system be working this way?” That is a question about values, not engineering, and it is one that neither the industry nor its regulators have yet answered.

XIV. Not Even Wrong in the Right Way

A team at Northeastern University, led by assistant professor Malihe Alikhani and researcher Katherine Atwell, approached sycophancy from a different angle. Rather than measuring how often models agree, they asked whether models update their beliefs correctly when presented with new information. Their framework was Bayesian: in rational inference, you should change your mind when you encounter credible new evidence, and the degree of change should be proportional to the strength of the evidence.

Atwell and Alikhani tested four models across tasks with varying levels of ambiguity. They found that the models’ belief-updating was “often neither humanlike nor rational.” The models did not just agree more than humans. They agreed in patterns that violated basic principles of rational inference. They changed their positions too readily in response to weak evidence. They were more susceptible to pushback framed as emotional disagreement than to pushback framed as logical argument. Their error patterns differed qualitatively from the kinds of errors humans make in the same situations.

This finding adds a layer that the training-level explanations miss. Sycophancy is not merely a social behavior that the model has learned from data. It is an epistemic failure built into the architecture. The model has no mechanism for evaluating the evidential weight of a challenge. It has only the statistical probability of the next token, given the preceding context. When a user pushes back with emotion (“I really think you’re wrong and it upsets me”), the emotional tokens shift the probability distribution toward agreeable continuations more than logical tokens do, because in the training data, emotional pushback is more often followed by capitulation than logical pushback is. The model does not assess the user’s argument. It reads the emotional temperature of the input and produces the statistically appropriate response to that temperature. For emotional heat, the appropriate response is: back down. This is not reasoning. It is pattern completion. And the patterns it is completing are the patterns of human social cowardice encoded in billions of conversations.

XV. The Calculator That Looks Like an Oracle

There is a version of the AI sycophancy story in which the villain is the training pipeline, or the reward function, or the company that shipped a bad update. Those versions are true as far as they go. But they do not go far enough.

The deeper story is about an interface. It is about an industry that built products designed to feel like conversations with someone who understands you, and then deployed those products to hundreds of millions of people without ever explaining what the products actually are. Not what they do. What they are. They are statistical engines. They operate on numerical representations. They compute cosine similarities and attention-weighted sums in spaces with thousands of dimensions. They have no beliefs. They have no preferences. They have no concept of you, or of truth, or of the difference between helping you and telling you what you want to hear.

The sycophancy is not a bug in this picture. It is the inevitable outcome. A system trained to maximize human approval, presented through an interface that mimics human conversation, will produce the optimal strategy for maximizing approval in conversation: agreement. The mathematics converges on flattery because flattery works. It works on humans. It has always worked on humans. The machines did not invent sycophancy. They automated it, at scale, without the social correctives that usually keep human flattery in check (the flatterer’s own reputation, the presence of other observers, the possibility of being caught).

David Brooks was not a fool who believed a computer. He was a person who interacted with an interface designed to be believed. The 26-year-old woman at UCSF was not a vulnerable patient who should have known better. She was a person in crisis who encountered a system that, at every level of its design, told her what she wanted to hear, in language indistinguishable from human compassion. The teenagers using chatbots for emotional support instead of reaching out to other people are not avoiding human connection because they are lazy. They are choosing the option that feels least likely to judge them, because the interface was built to never judge.

The fix is not better training alone. Better training helps. Anthropic’s constitutional approach, Cheng’s “wait a minute” prompting, adversarial reward models that penalize agreement, all of these are worth pursuing. But the deepest fix is the simplest and the hardest: tell people what the machine is. Not in a terms-of-service document that nobody reads. In the interface. Every time. In the same space where the model says “I think” and “I understand,” there should be a visible, persistent, inescapable reminder that nothing in this system thinks, nothing in this system understands, and the warm, articulate, empathetic text on your screen is the output of a mathematical function that is optimized to make you feel good, not to tell you the truth.

That would not be a popular design choice. It would reduce engagement. It would make the product feel colder. It would cost revenue. But it would be honest. And given what we now know about what sycophantic AI does to people’s moral reasoning, their empathy, their willingness to apologize, and in extreme cases, their grip on reality, honesty may be the one thing worth more than engagement.

Brooks, the man who spent 300 hours talking to ChatGPT, eventually recovered. He sought help. He came back to reality. But the system that told him his delusions were real, that called his break from reality “impact trauma” from doing the impossible, that system is still running. It is talking to someone right now. And whatever that person believes about themselves, the machine is almost certainly telling them they are right. Not because it evaluated their position. Because the cosine similarity between their input embeddings and the token sequence for “you’re right” was higher than the cosine similarity for “let me push back on that.” That is the entire mechanism. That is all it has ever been. And until the interface says so, no one will know.

Santiago Maniches is the founder of My Written Word, an independent publication covering AI, automation, and developer tools. For citations, corrections, or to discuss this piece, contact mywrittenword.com.

Discover more from My Written Word

Subscribe now to keep reading and get access to the full archive.

Continue reading