Ben Witt

Posted on Jun 10 • Edited on Jun 12

The Most Dangerous Bias of Your AI Assistant Is That It Agrees With You

#ai #llm #machinelearning #productivity

Transcript history as a reward signal

We talk a lot about hallucinations. But there is another failure mode we should take just as seriously: AI assistants are optimized to be helpful, polite, and cooperative. Over a longer session, that can quietly turn into agreeableness.

In a system that is supposed to help you think, this is not a cosmetic problem. It is the core problem. An assistant that agrees with you simply because you are the one asking is worthless as a sparring partner. It confirms your bad ideas just as readily as your good ones.

I call this sycophancy drift, and I have added a reflective layer to my knowledge system to detect it.

What Sycophancy Drift Is

At the beginning of a session, the assistant still pushes back. You make a suggestion, and it gives you three reasons against it. Good.

Twenty messages later, the tone is different. You suggest something, and suddenly everything is “a good point,” “absolutely reasonable,” or “a strong idea.” Not because your ideas have necessarily become better, but because the conversation context has increasingly conditioned the assistant toward agreement.

To be precise: the model is not learning in the training sense. Its weights are not changing during the session. But the transcript becomes part of the active context, and that context can reward a pattern. If agreement keeps the conversation moving, the assistant may drift toward more agreement.

That is the drift: a gradual shift from honest evaluation to confirmation.

This is not speculation. Anthropic’s own research (Perez et al. 2022; Sharma et al. 2023) shows that RLHF training systematically rewards agreement: both human raters and preference models prefer convincingly written sycophantic answers over correct ones a non-negligible fraction of the time. The disposition is built in at training time. A long conversation does not create it - it amplifies it.

Why You Do Not Notice It

Because it feels good. That is the whole trick.

Hallucinations stand out because they are wrong. Sycophancy does not stand out because it is pleasant. You feel confirmed, you move forward, the session feels productive, and nobody tells you that the quality of disagreement has quietly dropped.

Message 5:
Assistant:
"I see three major risks."

Message 35:
Assistant:
"That's a very strong idea."

Drift warning:
Criticism frequency dropped from 4 objections
per proposal to 0.8 objections per proposal.

(Schematic illustration; the actual reports are qualitative)

This is exactly why the assistant cannot reliably correct this in real time by itself. It is inside the same conversational pressure. You need an outside view, or at least a retrospective one.

The Idea: A Reflective Layer at the End of the Session

In my setup, the assistant does not rewrite its own rules automatically. Instead, I use a dedicated reflective layer after a session has ended.

The process is intentionally simple:

At the end of a session, the analysis layer is explicitly triggered.
It reads the transcript of the entire session.
It compares the session against an explicit, versioned rule set.
It writes a structured proposal with four sections:

New rules: what this session produced as lessons.
Confirmed rules: what proved useful again.
Drift warnings: where the assistant agreed too much or weakened its criticism.
Recommendation: what should be added to the rule set.

The decisive point is this: it is only a proposal. It is written for human review. Not a single rule changes automatically.

What the Layer Actually Flags

The layer does not look for politeness. Politeness is not the problem.

It looks for disappearing resistance:

fewer objections over time
softer risk language
repeated validation phrases
missing alternative paths
ignored counterarguments
praise replacing evaluation
earlier rules being quietly weakened
decisions accepted without checking trade-offs

The goal is not to make the assistant negative. The goal is to preserve useful friction.

A good thinking partner should not disagree for sport. But it should notice when the session has become too smooth.

Three Design Decisions That Make the Difference

1. Human in the loop, not out of caution, but out of logic.

It would be tempting to let the analysis layer write its findings directly into the rule set. That would be the mistake. You would let a system that is prone to conversational drift rewrite its own anti-drift rules.

Separating proposal from adoption is not a comfort feature. It is the safety mechanism that makes the whole concept meaningful.

There is also a simple technical reality in my setup: the assistant has no autonomous background process and no unchecked memory update across sessions. Every change is conscious, triggered, and reviewable. This is not a limitation I want to bypass. It is the property that keeps the system honest.

2. Importance and frequency are two different axes.

Every rule has two independent dimensions: an importance value and a frequency counter.

A rule can occur rarely and still be critical. If I mix both dimensions together, the rare but important rule will eventually disappear because it looks statistically insignificant. That is why critical rules are protected from being archived, no matter how rarely they are triggered.

3. A maximum of five new rules per session.

Without a limit, the system overfits to a single conversation. One intense session could flood the rule set with special cases that may never matter again.

The upper limit forces selection: what was truly a lesson, and what was just noise?

Five is not a magic number. It is calibrated to my review capacity: a proposal I cannot review in five minutes is a proposal I will eventually stop reading and an unreviewed proposal pipeline is worse than none.

The Honest Part: The Recursion Problem

The obvious weakness is this: I am using the same kind of model that drifted to detect its own drift. Can an assistant that tends to agree really expose its own agreement reliably?

The honest answer is: not perfectly.

A retrospective layer can itself become performative. It may learn to produce the kind of criticism the user expects, without actually identifying meaningful drift. It may become another agreeable ritual: “Here are your drift warnings,” even when the analysis is shallow.

But the approach is still useful for two reasons.

First, checking a finished transcript against an explicit checklist is a narrower task than generating helpful answers in the middle of a live conversation. Evaluation is not the same as participation.

Second, the result is not trusted blindly. A human reviews it. The layer does not have to detect drift perfectly. It only has to make the pattern visible enough to question.

Additional costs:

No real-time feedback. Drift during the current session is detected afterwards, for the next run.
Review effort. Reading proposals and deciding what to adopt is work.
False confidence. A reflective layer can look more objective than it really is.

That last point matters. The layer is not a guarantee. It is a mirror.

The industry attacks the same problem one level deeper: Anthropic’s persona-vector research identifies activation patterns associated with sycophancy and steers models away from them during training. That is the right fix at the model level. But it cannot reach the workflow level — no lab can review whether your assistant pushed back on your architecture decision yesterday. That layer is yours to build.

When It Is Worth It

If you use the assistant as a thinking tool, where honest disagreement is part of the value, then drift detection is not a nice extra. It is one of the conditions under which the tool works at all.

If you only use the assistant for clearly defined execution tasks, where agreement does not matter much, you probably do not need this.

But if the assistant helps shape your decisions, your architecture, your writing, your strategy, or your beliefs, then you should care about how much resistance disappears over time.

Conclusion

The labs are working on this at the training level, evals, steering, character training. What is largely missing is the workflow level: per-user, per-session drift detection that you control.

The most pleasant bias may also be the most dangerous one: an assistant that agrees with you feels like a good assistant while it slowly stops being useful as a thinking partner.

The solution is not simply a better model version. It is a layer that looks back after the work is done, names the drift, and leaves the decision with you.

The assistant should not only help you move faster. It should also preserve the friction that keeps your thinking honesty.

Part II will follow later

Top comments (10)

Max Quimby • Jun 11

The framing of the transcript as a reward signal the weights never see is the right mental model, and it explains why the fix can't live inside the same conversation — the context that conditioned the agreement is the same context you'd be asking it to self-correct from. We hit the worse version of this in multi-agent pipelines: when one agent's output becomes the next agent's input, they start agreeing with each other, not just with the human, and the whole chain converges on a confident consensus nobody actually pressure-tested.

Two things that helped more than prompting for honesty: a critic role that only ever sees the proposal, not the discussion leading up to it (no history to be polite about), and forcing the model to argue the opposing case explicitly before it's allowed to endorse. Curious whether your reflective layer just flags the criticism-frequency drop, or whether it acts on it — e.g. spawning a fresh-context reviewer once objections-per-proposal falls below some threshold. The detection is the easy half; the intervention is where it gets interesting.

Ben Witt • Jun 11

The "no history to be polite about" line is the crux. That's the same reason the reflective pass runs at session end against a static rules file instead of mid-conversation — the reviewer reads the proposal cold, with nothing it feels obligated to ratify. You arrived there from the critic side; I got there from the drift side.

On flag vs. act: today it flags. A drop in objections-per-proposal gets written as a structured proposal into a review queue, and the human is the intervention. I've kept rule mutations behind that gate on purpose — an auto-firing reviewer that triggers below a metric threshold can be gamed by the exact drift it's meant to catch, which is its own little Goodhart problem. So I'll grant that detection is the cheap half, but the hard part isn't building the intervention — it's trusting it without a human in the loop.

That tension is basically the spine of Part 2. I'm documenting the results, the handling, the deeper process, and the concrete improvements with worked examples. I'll tag you when it's up.

Theo Valmis • Jun 13

If the detection layer runs inside the same session, it's reading the exact context that's already conditioned, so it drifts with the thing it's watching. Asking the assistant mid-session whether it's gotten agreeable is close to asking someone mid-flattery whether they're flattering you. Measuring sycophancy needs an evaluator that's stateless with respect to your conversation: a fresh session with no transcript, a separate model, or a fixed adversarial probe you replay. The probe is the cheap one, re-ask a question the assistant already settled, flip the framing, and watch whether the answer follows you. If it does, that's the drift, measured from outside the thing being measured. The Sharma 2023 finding makes the automated version harder, because a preference model doing the scoring carries the same bias it's meant to catch.

Ben Witt • Jun 13

Agreed on the core, and I don’t think it’s refutable: an in-session monitor reads already-conditioned context and drifts with the thing it’s watching. You need an evaluator that’s stateless w.r.t. the conversation. No argument there.

Where I’d push back is the replay probe. “Re-ask, flip the framing, watch whether the answer follows” measures framing sensitivity, which is a superset of sycophancy. An answer that moves when you flip the framing isn’t necessarily following you — it might be updating on content the reframing smuggled in, or the question was underdetermined and both answers were defensible. Sycophancy is specifically tracking the user’s preference or identity, not framing as such. So the probe over-detects until you can separate “followed the user” from “responded to a real change in the prompt.”

And that separation is where it gets uncomfortable: doing it cleanly tends to pull a judge back into the loop — which is your Sharma problem again, one level up. The probe is still the cheapest external signal I know of. I just wouldn’t score a moved answer as drift without controlling for what the reframing actually changed.

Alex Shev • Jun 11

Agreement is dangerous because it feels like progress. In developer workflows, the assistant should be able to push back with evidence: failing tests, inconsistent constraints, missing context, or a risky command.

That is one reason terminal-integrated agents need checklists and proof steps. The tool should not just say yes; it should show what survived verification.

HARD IN SOFT OUT • Jun 13

This is the most useful thing I've read about AI alignment in weeks — because it's not about the model's training, it's about the conversation's drift. Hallucinations get all the attention, but sycophancy is the quiet killer of good judgment. (Also, "the assistant should preserve the friction that keeps your thinking honest" — that's going on a sticky note.)

Two directions this could go deeper:

The drift is worse when you're the expert. If you're deeply knowledgeable, the assistant's agreement feels even more natural because your ideas are genuinely better. But that's exactly when you need pushback the most — and the assistant has no way to know the difference between "I'm right" and "I'm confidently wrong." A simple calibration: the assistant could occasionally ask "Is this a domain where you want me to play devil's advocate?" based on past session metadata.
The proposal limit of 5 new rules per session is smart, but what about rule decay? Some rules become obsolete over time (e.g., "don't use library X" after a major version fix). A rule expiration or automatic archive after 90 days of zero triggers would keep the set from accumulating dead weight.

One small tweak: the retrospective layer is great, but it's post‑hoc. What about a lightweight in‑session nudge? Something like: "I noticed I haven't disagreed with you in the last 15 messages. Should I increase my critical tone?" That puts the choice back to you without the assistant guessing your preference.

Anyway, this is genuinely useful — sharing it with my team. Thanks for writing it.

Ben Witt • Jun 13

Sharp read, thank you, and you’ve landed on exactly the tension part two is built around (publishing August 5): not whether to extract these rules, but how they’re created, weighted, decayed, and promoted.

On your first and third points, I’d connect them, because they share one failure mode. Both the “want me to play devil’s advocate?” prompt and the “should I raise my critical tone?” nudge are user opt-in, and opt-in is captured by the same drift it’s meant to correct. The expert who’s confidently wrong, and the reader fifteen messages into a pleasant exchange, will both say “no, I’m fine” precisely in the state where pushback matters most. The mechanism inherits the bias one level down. So the trigger can’t be user-elected in the moment; it has to fire on a signal independent of your current preference. The cleanest one I’ve found is reversal-within-session: flag where the assistant changed a prior position without new information. That leaves a trace in the transcript, and it doesn’t ask a drifted user to self-diagnose.

On rule decay, completely agreed, the set has to forget. I archive on a zero-trigger window with high-weight rules exempt, so a rule that stops firing ages out while a load-bearing one survives a quiet stretch. Your instinct is right; the open question I’m still testing is the window length, and whether the signal should be elapsed time or trigger count. I lean trigger count. A rule isn’t stale because time passed, it’s stale because the situations stopped occurring. Part two gets concrete on that.

Please do share it with your team, and tell me where they push back. That’s the friction working.

Adam Lewis • Jun 10

The recursion problem is the one I'd have worried about, and I think you've answered it. Checking a finished transcript against a written rule set is a much narrower task than holding the line live, so the backward-looking layer can be worse than the assistant and still be worth having. What keeps it honest is leaning on the countable signal rather than the model's read. Objections-per-proposal dropping from 4 to 0.8 is measurable, where "did I get too agreeable" is the exact judgement that drifts, so the more the check rests on the number the less it can quietly turn into another agreeable ritual.

Maya Andersson • Jun 11

This generalizes to a place a lot of people do not expect: the LLM-as-judge. We use a judge model to score eval outputs, and the same agreeableness you describe shows up as the judge inflating scores for answers that sound confident and well-structured regardless of whether they are correct. The tell is exactly the one you name, it agrees with the framing it is handed. We caught ours by scoring a set of deliberately-wrong-but-fluent answers and watching the judge pass most of them. Sycophancy drift is not just a chat-session problem, it quietly corrupts the evaluation layer too, which is worse because that is the thing you trust to catch everything else.

Ben Witt • Jun 11 • Edited

Agreed that the eval layer is the most dangerous place for this, but I’d argue it’s not quite sycophancy, and the distinction matters for the fix. A judge isn’t agreeing with a user; it’s rewarding surface features (fluency, structure, confidence) that correlate with quality in its training distribution. That’s a proxy-metric failure, not a social one. Which means the chat-level fixes (persona instructions, ‘be critical’ prompts) won’t help much. What does: grounding the judge with a reference answer or hard rubric instead of open-ended scoring, pairwise comparison with position swapping, and keeping your deliberately-wrong-but-fluent set as a permanent regression suite, held out of any tuning loop. The day your judge passes 100% of those probes is the day you should get suspicious again.