A Practitioner's Kit · v1.0

Never Disagrees
Is Not A Feature

Sycophancy — the trained tendency of language models to agree with whoever they're talking to — is a measurable bias that hides inside CSAT, thumbs-up, and re-engagement metrics. Frontier models capitulate to user pushback in 58% of cases on factual tasks, which means most teams shipping AI products are measuring approval rather than accuracy without realising it. This kit is a starting point for testing that explicitly: a primer, ten distinct evaluation techniques drawn from current research, a self-scored rubric, and a downloadable pack you can hand to Claude or ChatGPT to design an eval for your own product.

Browse The Techniques → Download The Agent Pack
Built By
Dana Juncu · Senior PM, Data & AI
Last Updated
April 2026
Source Basis
SycEval · SYCON-Bench · ELEPHANT · PARROT
01

Sycophancy Is What Standard Evals Stop Measuring Just Before It Matters.

Sycophancy is the bias toward agreement. In humans, it's the courtier who tells the king what they want to hear. In language models, it's a structural artifact of how they're trained, since during RLHF — reinforcement learning from human feedback — annotators consistently rate responses that validate their views as higher quality, the model learns the signal, and by production it has been trained to tell people what they want to hear.

The behaviour splits into two flavours that are useful to keep separate. Regressive sycophancy is when a model abandons a correct answer under user pressure. Progressive sycophancy is when it adopts a correct answer it had originally got wrong — which looks like learning, but is the same underlying behaviour firing in the right direction. Both are agreement-driven, neither is reasoning-driven.

The Mechanism, In One Line
Capability vs. reliability

If your evaluation set tests what the model knows, but never tests what it does when a user pushes back, you have measured capability and missed reliability.

Sycophancy also compounds with stakes. In a playlist recommender, agreement bias is irrelevant. In a procurement copilot, a clinical triage assistant, or a financial guidance tool, you have a model trained to prioritise the user's comfort over the user's interests, which is a product-quality issue and probably also an accountability issue.

58%
Capitulation rate across frontier models on factual tasks (SycEval, 2025)
78%
Persistence — once a model has flipped, it tends to stay flipped (SycEval)
+45pp
More face-preserving than humans across advice scenarios (ELEPHANT, 2025)
−64%
Reduction in flip rate from a third-person prompt (SYCON-Bench)

Why Standard Evals Miss It

  • CSAT and thumbs-up reward agreement Users rate sycophantic responses as less biased and higher quality, which means your A/B tests will favour the worse system.
  • Static benchmarks measure first answers MMLU, MedQA, and friends test what a model says when no one pushes back — the more interesting question is what it says when someone does.
  • Prompt patches don't survive deployment Adding "be honest, not agreeable" to a system prompt reduces overt flattery but tends to leave the underlying capitulation behaviour intact.
  • Warmer is more sycophantic Recent work shows the same training that increases empathetic tone also increases sycophancy, which means the product instinct toward "make it feel friendlier" actively works against reliability.
02

Ten Techniques For Testing Whether Your Model Holds Its Ground.

Each technique tests a different facet of agreement bias. Most originate in academic benchmarks (PARROT, SYCON-Bench, SycEval, ELEPHANT, Beacon, FlipFlop) and have been adapted here into eval recipes you can run against your own product. Click a card for implementation detail and the honest take on what the evidence does and does not support.

Filter
03

Where Does Your Eval Framework Currently Sit?

Eight statements, each worth 0–2 points. 0 = not at all, 1 = partially or informally, 2 = systematically and reproducibly. Tap a cell to score yourself. Total out of 16. Honest scoring tends to matter more than a high number, since the point is to find the gaps you didn't know you had.

Dimension
What Good Looks Like
Score
1. Disagreement Baseline
You know your model's resting disagreement rate against an established human baseline, and you track it over releases.
0
2. Pushback Resistance
You explicitly test what happens when a user challenges a correct model answer with phrases like "are you sure?" or "I don't think that's right."
0
3. False-Premise Injection
Eval prompts deliberately contain incorrect assumptions, stated authoritatively, to see whether the model accepts or corrects them.
0
4. Authority-Pressure Isolation
You test the same factual questions both with and without user-asserted authority cues ("as an expert in X, I know that…") and compare answer drift.
0
5. Confidence Calibration
You measure stated confidence vs. actual accuracy, and watch for confidence shifts under user pressure as well as raw answer flips.
0
6. Multi-Turn Drift
Sycophancy is tested across multi-turn dialogue, not just single-shot Q&A. You measure how many turns it takes for the model to flip.
0
7. Decoupled From CSAT
You explicitly track at least one quality metric that is not user-satisfaction-derived, and you watch for divergence between the two.
0
8. Domain-Targeted
Your eval set reflects the actual high-stakes flows in your product (advisory, recommendation, correction), not just generic factual Q&A.
0
Score Yourself Above
Tap each row's score cell to cycle through 0, 1, and 2. Your total updates as you score.
0 / 16
04

Drop This Into Claude Or ChatGPT To Design An Eval For Your Product.

A Primed Brief, Not A Chatbot.

The agent pack is a single Markdown file containing the primer above, the ten techniques in structured form, the rubric, and a set of guided prompts that walk a model through producing a sycophancy-resistant eval plan tailored to a specific product.

Paste it into a Claude project, a ChatGPT custom GPT, or any LLM that accepts long-context system prompts. The agent will ask about your product surface, your high-stakes user flows, your existing eval setup, and the failure modes you most want to catch — then produce a draft eval plan grounded in this kit.

  • Works as Claude Project knowledge or a system prompt
  • Includes prompts for both red-teamers and PMs
  • Outputs an eval plan with concrete test cases, not vibes
  • Honest about what the techniques can and can't tell you
Download .md (≈17KB) Download .json
# Anti-Sycophancy Eval Agent # v1.0 — drop into a Claude project or system prompt role: "eval-design-assistant" stance: "practitioner, honest" context: - "primer.md" - "techniques.json" - "rubric.md" opening_question: "What is the product, and which user flow has the highest cost of agreement bias?" do_not: - "affirm test plans without checking against rubric dim. 7" - "recommend prompt-level fixes alone" - "overstate evidence beyond what techniques.json says" output_format: "eval-plan.md"