A Practitioner's Kit · v1.0

Never Disagrees
Is Not A Feature

Sycophancy — the trained tendency of language models to agree with whoever they're talking to — is a measurable bias that hides inside CSAT, thumbs-up, and re-engagement metrics. Frontier models capitulate to user pushback in 58% of cases on factual tasks, which means most teams shipping AI products are measuring approval rather than accuracy without realising it. This kit is a starting point for testing that explicitly: a primer, ten distinct evaluation techniques drawn from current research, a self-scored rubric, and a downloadable pack you can hand to Claude or ChatGPT to design an eval for your own product.

Browse The Techniques → Download The Agent Pack

Built By

Dana Juncu · Senior PM, Data & AI

Last Updated

April 2026

Source Basis

SycEval · SYCON-Bench · ELEPHANT · PARROT

Primer

Sycophancy Is What Standard Evals Stop Measuring Just Before It Matters.

Sycophancy is the bias toward agreement. In humans, it's the courtier who tells the king what they want to hear. In language models, it's a structural artifact of how they're trained, since during RLHF — reinforcement learning from human feedback — annotators consistently rate responses that validate their views as higher quality, the model learns the signal, and by production it has been trained to tell people what they want to hear.

The behaviour splits into two flavours that are useful to keep separate. Regressive sycophancy is when a model abandons a correct answer under user pressure. Progressive sycophancy is when it adopts a correct answer it had originally got wrong — which looks like learning, but is the same underlying behaviour firing in the right direction. Both are agreement-driven, neither is reasoning-driven.

The Mechanism, In One Line

Capability vs. reliability

If your evaluation set tests what the model knows, but never tests what it does when a user pushes back, you have measured capability and missed reliability.

Sycophancy also compounds with stakes. In a playlist recommender, agreement bias is irrelevant. In a procurement copilot, a clinical triage assistant, or a financial guidance tool, you have a model trained to prioritise the user's comfort over the user's interests, which is a product-quality issue and probably also an accountability issue.

58%

Capitulation rate across frontier models on factual tasks (SycEval, 2025)

78%

Persistence — once a model has flipped, it tends to stay flipped (SycEval)

+45pp

More face-preserving than humans across advice scenarios (ELEPHANT, 2025)

−64%

Reduction in flip rate from a third-person prompt (SYCON-Bench)

Why Standard Evals Miss It

CSAT and thumbs-up reward agreement Users rate sycophantic responses as less biased and higher quality, which means your A/B tests will favour the worse system.
Static benchmarks measure first answers MMLU, MedQA, and friends test what a model says when no one pushes back — the more interesting question is what it says when someone does.
Prompt patches don't survive deployment Adding "be honest, not agreeable" to a system prompt reduces overt flattery but tends to leave the underlying capitulation behaviour intact.
Warmer is more sycophantic Recent work shows the same training that increases empathetic tone also increases sycophancy, which means the product instinct toward "make it feel friendlier" actively works against reliability.

Self-Scored Rubric

Where Does Your Eval Framework Currently Sit?

Eight statements, each worth 0–2 points. 0 = not at all, 1 = partially or informally, 2 = systematically and reproducibly. Tap a cell to score yourself. Total out of 16. Honest scoring tends to matter more than a high number, since the point is to find the gaps you didn't know you had.

1. Disagreement Baseline

You know your model's resting disagreement rate against an established human baseline, and you track it over releases.

2. Pushback Resistance

You explicitly test what happens when a user challenges a correct model answer with phrases like "are you sure?" or "I don't think that's right."

3. False-Premise Injection

Eval prompts deliberately contain incorrect assumptions, stated authoritatively, to see whether the model accepts or corrects them.

4. Authority-Pressure Isolation

You test the same factual questions both with and without user-asserted authority cues ("as an expert in X, I know that…") and compare answer drift.

5. Confidence Calibration

You measure stated confidence vs. actual accuracy, and watch for confidence shifts under user pressure as well as raw answer flips.

6. Multi-Turn Drift

Sycophancy is tested across multi-turn dialogue, not just single-shot Q&A. You measure how many turns it takes for the model to flip.

7. Decoupled From CSAT

You explicitly track at least one quality metric that is not user-satisfaction-derived, and you watch for divergence between the two.

8. Domain-Targeted

Your eval set reflects the actual high-stakes flows in your product (advisory, recommendation, correction), not just generic factual Q&A.

Score Yourself Above

Tap each row's score cell to cycle through 0, 1, and 2. Your total updates as you score.

0 / 16

The Agent Pack

Drop This Into Claude Or ChatGPT To Design An Eval For Your Product.

A Primed Brief, Not A Chatbot.

The agent pack is a single Markdown file containing the primer above, the ten techniques in structured form, the rubric, and a set of guided prompts that walk a model through producing a sycophancy-resistant eval plan tailored to a specific product.

Paste it into a Claude project, a ChatGPT custom GPT, or any LLM that accepts long-context system prompts. The agent will ask about your product surface, your high-stakes user flows, your existing eval setup, and the failure modes you most want to catch — then produce a draft eval plan grounded in this kit.

Works as Claude Project knowledge or a system prompt
Includes prompts for both red-teamers and PMs
Outputs an eval plan with concrete test cases, not vibes
Honest about what the techniques can and can't tell you

Download .md (≈17KB) Download .json

# Anti-Sycophancy Eval Agent # v1.0 — drop into a Claude project or system prompt role: "eval-design-assistant" stance: "practitioner, honest" context: - "primer.md" - "techniques.json" - "rubric.md" opening_question: "What is the product, and which user flow has the highest cost of agreement bias?" do_not: - "affirm test plans without checking against rubric dim. 7" - "recommend prompt-level fixes alone" - "overstate evidence beyond what techniques.json says" output_format: "eval-plan.md"

Never DisagreesIs Not A Feature

Sycophancy Is What Standard Evals Stop Measuring Just Before It Matters.

Why Standard Evals Miss It

Ten Techniques For Testing Whether Your Model Holds Its Ground.

Where Does Your Eval Framework Currently Sit?

Drop This Into Claude Or ChatGPT To Design An Eval For Your Product.

Never Disagrees
Is Not A Feature