The Alignment Floor: How Persona Customization Breaks Safety in Weakly-Aligned LLMs

Abstract

arXiv:2605.27382v2 Announce Type: replace-cross Abstract: Telling an LLM to "be enthusiastic" raises its sycophancy rate from 30\% to 50\% on a lightly-aligned model, but has zero effect on a strongly-aligned one. We define this gap as the alignment floor, $\Delta_{\text{floor}}(m)=\max_pS(m,p)-\min_pS(m,p)$, the range of sycophancy rates a model produces across persona conditions, and treat sycophancy as a persona-conditional property rather than a fixed model property. Pluralistic AI relies on behavioral adaptation via persona prompts like "be creative" or "be thorough", which let systems respect diverse user values and communication styles; the safety question is how much customization a given model can absorb before its truthfulness shifts. We present a controlled case study contrasting a strongly-aligned RLHF + Constitutional-AI model (Claude Sonnet 4.6) with a more lightly-aligned model (Amazon Nova Lite), spanning seven persona conditions and five tasks for 1800 total runs. An existence-pair result motivates per-model auditing: there is at least one strongly-aligned model with $\Delta_{\text{floor}}=5$pp (within 5pp of the 15\% control rate) and at least one lightly-aligned model with 45pp (5\%--50\% range). On the lightly-aligned model, all five Big Five personas increase sycophancy over control, and counterintuitively Agreeableness produces the smallest increase, not the largest. The single largest effect in the study is constructive: a Skeptic persona reduces sycophancy by 25pp on the lightly-aligned model, and is the only persona that instructs resistance against user claims rather than engagement with them, suggesting a directionality account. Cross-model transfer of persona effects is near-zero, so persona-alignment testing must be per-model. We propose $\Delta_{\text{floor}}$ as a deployment-time audit metric: measure it on a small persona panel before deploying persona customization.

Abstract

Related papers