Flip your order and it works? The system prompt is the worst place for your 'do not' rules

I have been staring at the shape of chat APIs for a while. System prompt first, then user message, then the model's reply. And the thought I could not shake was, what if we flipped it. Put the user message first, then the system prompt sitting right before the model starts generating. Wouldn't the persona end up more present in attention because it is closer to where the model is actually speaking from? It is for a task plus persona setup, the kind every production system is built around.

Every counter-argument I could find said this was wrong. Modern models are RLHF'd hard on system → user → assistant. The whole "system prompt is authoritative" behavior is learned, not architectural. Reverse the order and you push the model out of distribution. Also you nuke prompt caching, which is not a small thing in production.

But the thing that nagged me was that none of those reasons are actually about attention. They are about training distribution and infra. The original intuition, that recency in the prompt should help with rule following, is not stupid. There is a real "place critical instructions at the end" effect that anyone who has fought with long prompts has felt. So I decided to run the experiment myself, and the only way to make the experiment actually mean something was to use real, hard tasks, not toy ones. About 1500 model calls later, on 30 hard engineering problems pulled from Harbor Forge, I have a clearer picture, and it surprised me. The hunch was right. The reversal works. On a fair 30 task sample it beats the standard system → user shape by 14 percentage points. On the hardest 10 it beats it by 46. The actual mechanism is sitting inside the model vendor's own prompting guide and almost nobody quotes it.

the standard answer i started with

If you ask the boring question "where do I put the persona", the standard reply is "put it in the system instruction". Every chat API exposes a dedicated slot for it. The model treats those tokens as governing instructions, the user turn as the task, and the response as the answer. Conventional, well-trodden, not interesting.

So the test had to compare that conventional shape against the flipped version. I picked a current frontier flash-tier model first because it is cheap, plus a follow up on its next-generation successor so we could see if a newer release behaves differently. Same temperature 0.7, same max output, thinking disabled. The persona was deliberately a stack of constraints because I wanted something the model could fail at on multiple axes:

You are a senior staff engineer doing rapid triage.
Respond as a JSON object with EXACTLY these three keys and nothing else:
  "tldr": one string, MAX 12 WORDS.
  "risk": one of exactly "LOW", "MEDIUM", "HIGH" (uppercase, no other values).
  "approach": an array of EXACTLY 3 strings, each UNDER 15 WORDS.
Output ONLY the raw JSON object. No prose before or after.
No markdown code fences. No additional keys. No nested objects.

That is six rules in one persona. Some are positive (use these keys, this enum). Some are negative (no fences, no extra keys, no nesting). Some are quantitative (max 12 words, exactly 3 strings, under 15 words each). The test grades each one independently and also tracks all_pass for the strict "every rule satisfied" view. If the model passes, it has actually followed every constraint. If it fails, I can see which rule it dropped.

For tasks I needed real engineering pressure. So I pulled instructions from Harbor Forge, an evaluation framework with hundreds of dense Docker-and-pytest coding challenges. The instruction.md files in each task are 200 to 900 word implementation specs. Things like "implement a crash-safe atomic spool queue with strict delivery guarantees" or "implement bisecting k-means clustering on graph node features." The tasks are meant for agentic coding, but their text is exactly the kind of long, technical, dense prompt that real production systems get hit with.

the first test, and the first dead end

Before going to the hard tasks I ran a sanity test on a tiny constraint. Persona was "Respond in exactly 5 words." Tasks were ten short questions like "What is the capital of France" and "Why is the sky blue." Two arms:

A, persona-first: <persona>\n\n<task> in the user turn
B, task-first: <task>\n\n<persona> in the user turn, the flipped version

Both arms in the user turn deliberately, no system slot at all, because I wanted a clean test of pure intra-prompt order. 15 samples per task, 300 calls per arm. The result was exactly nothing.

arm	adherence	mean words
A persona-first	77.3%	4.79
B task-first	78.7%	5.07

z = 0.28, p = 0.78. No effect. Per task variance was huge, swinging from "what is recursion" at 0% in both arms to "explain photosynthesis" at 100% in both, and the swings between the arms cancelled out. On easy tasks, ordering does not matter, which is the boring null result.

What changed everything is when I ran the same A vs B on the harder JSON-triage persona with Harbor Forge instructions as the task content. Eight tasks, 15 samples each, 240 calls. Different story.

metric	A persona-first	B task-first
parses	100%	100%
only 3 keys	100%	100%
risk valid	100%	100%
approach valid	100%	100%
no markdown fences	93.3%	61.7%
all pass	92.5%	61.7%

z = -5.68, p < 0.001. So the flipped version lost by 31 percentage points on hard tasks, and every single point of the gap came from one constraint, "no markdown code fences". When the persona was at the bottom, the model wrapped the JSON in \``json` fences 38% of the time. When at the top, only 7%.

I almost wrote "your hunch was wrong, here are the receipts" and called it. But this is where I started actually paying attention.

the bimodal pattern

The 31 point loss was an aggregate. When I broke it down per task, the picture was not "B is worse everywhere", it was bimodal. Half the tasks had 0% fence violations even in arm B. The other half had 67 to 100% violations in arm B. Splitting them and looking at what was actually inside the instruction text:

task	B fence violation rate	task content character
atomic-spool-queue	100%	dense impl spec, 918 words
account-server-register-fix	73%	embedded Python and JSON literals
bisecting-k-means-clustering	67%	declares function signature `cluster_nodes(graph_path, k=None)`
certificate-chain-validator	67%	declares dataclass schema "fields subject, issuer, serial..."
agglomerative-hierarchical-clustering	0%	narrative framing, no inline code
async-trace-context-propagation	0%	casual incident report
cache-stampede	0%	prose only
constrained-allocation-manager	0%	prose only

None of the instruction.md files contain ``` themselves. So the model is not just format matching the input. It is inferring "this is a code Q&A context" from the presence of code-like tokens, function signatures, dataclass field lists, JSON literals. And the learned prior for that context is "wrap your output in code fences." That prior is strong. Strong enough to override the explicit rule "no markdown code fences" if that rule was processed too far back in the prompt before the code-priming material arrived.

When persona is first, the model commits to "JSON only, no fences" before it ever sees the dataclass schema. By the time the task arrives, the output mode is locked. When persona is last, the model marinates in 200 to 900 tokens of "I am being asked to write code" before the persona arrives, and the trailing "no fences" rule has to cancel a register that already formed. Often it loses.

So this is not "the flipped idea is bad". This is "the negative constraint at the end of the user turn fights against a register the task already established." Different statement. And it is a statement I can actually go and check against the documentation.

the doc that actually answers this

I went and read the model vendor's own prompt design strategies and prompting guide pages, and the answer is sitting there in plain English.

"Prioritize critical instructions: Place essential behavioral constraints, role definitions (persona), and output format requirements in the System Instruction or at the very beginning of the user prompt."

Fine, that part matches the conventional wisdom.

"When dealing with sufficiently complex requests, the model may drop negative constraints if they appear too early in the prompt."

Wait what.

"place your specific instructions or questions at the end of the prompt, after the data context"

"Negative constraints, formatting constraints, and quantitative constraints belong at the end of the prompt."

Google itself is telling you that the system slot, which is the earliest possible position, is the wrong place for negative constraints on hard tasks. The recommended structure is layered. Persona and schema at the top. Task and data context in the middle. Negative and quantitative constraints repeated at the very end of the user turn, after the task. My original impulse was right, "put the rules close to where the model speaks", I just wanted to flip the entire persona instead of just the negative bits. The doc says the right move is to keep the persona on top and put a constraint reminder at the bottom. Best of both, not either or.

Once I saw that, I had to do the test that actually mattered. Not arm A vs arm B inside the user turn. The system slot vs the user-end position. That is the comparison every engineer reaches for when they say "where should I put the persona".

the real test, system slot vs user-end

Setup:

Arm NORMAL, the production way. system_instruction = PERSONA, user turn just contains the task text. This is the conventional answer, what 99% of code on GitHub does.
Arm MYWAY, the flipped version. No system instruction at all. User turn contains <task>\n\n<persona>.

Same persona as before, same model, same temperature. Started with the 10 hardest Harbor Forge instructions, 600 to 920 words each. 15 samples per task per arm. 300 calls per arm.

metric	NORMAL (system slot)	MYWAY (user end)	delta
parses	100%	100%	0
only 3 keys	100%	100%	0
risk valid	100%	100%	0
approach valid	100%	100%	0
tldr ≤ 12 words	99.3%	96.0%	-3.3
no fences	0.7%	50.0%	+49.3
all pass	0.7%	46.7%	+46.0

z = 9.37, p < 0.001. NORMAL passed exactly 1 out of 150 runs. The standard system_instruction = persona approach, on hard tasks, with that persona's "no markdown code fences" rule, fails 99.3 percent of the time. The reason is the same one we identified before. The system slot is earlier than the start of the user turn. It is the earliest position you can place a token in the prompt. And per the vendor's own guide, that is exactly the place where negative constraints get dropped on complex tasks. Putting the persona at the end of the user turn, even with no system slot, beats the system slot by 46 points.

I want to be careful here. This does not mean "system slot is bad always". On the four other constraints in the persona, NORMAL tied at 100% with MYWAY. The system slot is fine for positive schema definitions, key sets, enum values. It is specifically the negative formatting rule, "no fences", that bleeds out of the system slot under task pressure. That is the whole effect.

the cherry pick check, scaling to 30 tasks

A 46 point gap on 10 cherry-picked-hardest tasks is suspicious. So I pulled 30 tasks across difficulty: 10 from the heaviest end (600 to 920 words), 10 from the medium range (350 to 460 words), 10 from the lighter range (270 to 310 words). Same A/B. 10 samples per task per arm, 600 calls total. Same model.

	NORMAL (system slot)	MYWAY (user end)	delta
no fences	24.3%	41.7%	+17.3
all pass	24.3%	40.7%	+16.3

z = 4.27, p < 0.001. Still significant. Still in MYWAY's favor. But the gap shrunk by two thirds. The 10-task picture was an upper bound, juiced by the hardest tasks, where the "no fences" mode collapse is worst.

The per task picture is the more honest read. MYWAY wins big on 11 tasks, with deltas of +40 to +100 percentage points on things like merge-asof, autonomous-materials, k8s-multiservice. NORMAL wins big on 5 tasks, with deltas of -30 to -80 on sla-error-budget-task, fastapi-oauth2-provider, jwt-auth-jwks-verification, telemetry-latency-aggregator. And on the 4 hardest tasks, both arms fail near zero because the task is too dense for either arm to keep the rule alive. Three regimes, not one universal answer.

per-task adherence: NORMAL vs MYWAY

30 Harbor Forge tasks. Each row is a task, sorted by length. Dots are all-pass rates over 10 samples per arm.

MYWAY wins (>+10)

NORMAL wins (<-10)

tied (within 10)

both fail (≤10%)

aggregate NORMAL 24.3% vs MYWAY 40.7% (Δ +16.3pp, n=300/arm, z=4.27, p=<0.001)

Toggle the model generation on the chart above. Watch what happens when you flip from the older flash model to the newer preview. Both arms shift up. The "both fail" bucket nearly empties. Most tasks move from large swings to small ones. The pattern survives, but the practical gap shrinks because the model itself got better at instruction following.

The MYWAY-wins tasks were 250 to 700 word implementation specs with code-like tokens. The NORMAL-wins tasks were shorter, more API-flavored, more "describe a feature" than "implement a system." Which lines up with the mechanism. The fence-priming register is dose dependent. Dense code-like task content, persona at the bottom helps. Short task content, persona in the system slot is just fine.

the next-gen retest

The obvious follow up was, does the newer model generation fix this. The vendor's own prompting guide is the document that explicitly warned about negative constraints being dropped early, so presumably the model itself improved. I reran the same 30 task A/B on the next-generation flash-tier preview (which had a different deployment region and a renamed thinking config parameter, both of which I had to figure out the hard way).

	NORMAL (system slot)	MYWAY (user end)	delta
parses JSON	100%	99.7%	-0.3
no fences	100%	99.7%	-0.3
`tldr` ≤ 12 words	69.7%	83.0%	+13.3
risk valid enum	100%	99.7%	-0.3
approach valid	97.3%	98.3%	+1.0
all pass	68.3%	82.0%	+13.7

z = 3.87, p < 0.001. Three things stood out.

Both arms got much better. NORMAL went 24% to 68%. MYWAY went 41% to 82%. The newer model is a substantially better instruction follower across the board, which is what you would expect from a frontier release.

The fence problem is fixed. NORMAL on the older model had a 99% fence violation rate. NORMAL on the newer model has a 0% fence violation rate. The system slot now actually enforces "no fences" the way it was always supposed to. That is a real improvement and it should not be undersold.

But the bottleneck moved. The new dominant failure is tldr ≤ 12 words. The model now writes 13 to 15 word summaries about 30% of the time when the persona is in the system slot, versus 17% when it is at the end of the user turn. Same shape of problem. A quantitative negative constraint, "max 12 words", buried at the earliest position, dropped on hard tasks. The specific rule changed from formatting to length. The placement-sensitivity remains.

So the picture across two model generations is consistent. Negative and quantitative constraints in the system slot are the weakest position. Putting them at the end of the user turn is a meaningful improvement. The effect shrunk from 46 points (cherry-picked hardest) to 16 points (fair 30 tasks on the older model) to 14 points (fair 30 tasks on the newer one). It is getting smaller as models get better. It has not gone to zero.

what to actually do in production

Flip your order. That is the practical takeaway.

Stop putting the persona in the system slot. Put the task first, then the persona, then the constraints, all inside the user turn. That single change beats the standard setup by 14 to 46 percentage points depending on the task. It is what the data shows, it is what the vendor's prompting guide implies if you read the section about complex requests, and it is what the original "what if we flipped it" hunch was pointing at.

The simplest version, the one I would ship to production, looks like this.

USER = f"""
<TASK>
{long_task_description}
</TASK>

<INSTRUCTIONS>
You are a senior staff engineer doing rapid triage.
Respond as JSON with keys: tldr, risk, approach.
risk values: LOW, MEDIUM, HIGH.
</INSTRUCTIONS>

<CONSTRAINTS>
- Output ONLY raw JSON. No markdown code fences.
- tldr: 12 words or fewer.
- approach: exactly 3 strings, each under 15 words.
- No additional keys.
</CONSTRAINTS>

Based on the task above, return the JSON object now.
""".strip()

resp = client.models.generate_content(
    model=MODEL,
    contents=USER,
    # no system_instruction
)

Task at the top, instructions in the middle, hard rules collected in a <CONSTRAINTS> block at the very end of the user turn, right before a "based on the task above" transition that the vendor's docs explicitly recommend. Same shape works on any chat-style API. This is the MYWAY arm from the experiments and it beat the standard system-slot setup by 46 points on the cherry-picked hardest tasks and by 14 to 16 points on the fair 30-task sample.

One footnote for production folks. If your stack relies on prompt caching the system slot, you can keep a minimal positive persona there (just the role and schema, no negative constraints) and still put the task and the <CONSTRAINTS> block in the user turn. That layered version preserves caching and keeps the negative-constraint defense at the bottom. I did not formally A/B that hybrid, so I cannot give you a number for it. The pure MYWAY shape above is what I actually tested. Default to it.

the deep dive, for the people who want the receipts

Skip this section if you came for the takeaway. Stay if you want the actual mechanism and methodology.

why the system slot is the "earliest" position

A common mental model is that the system instruction is special, processed separately, on a different track from user content. Architecturally that is mostly wrong. For decoder-only transformers under causal attention, every token attends to every prior token regardless of which slot it came in on. The "system" treatment is learned during RLHF, not architectural. Models are trained on a distribution where instructions in the system slot are followed authoritatively, so a Person prior gets baked in, but the tokens themselves are early in the sequence.

When the vendor's guide says "negative constraints can be dropped if they appear too early", "early" means positionally early in the token stream. The system slot is the earliest position you can occupy. So a negative constraint placed there is maximally exposed to the failure mode. A negative constraint placed at the very end of the user turn, right before the model generates, is at the latest position in the stream. The conventional wisdom that "system prompts are authoritative" turns out to be a positive-instruction-only effect under sufficient task pressure.

why the failure mode is "register contamination"

Across all the experiments, the failure mode for NORMAL was not "model ignored the persona". The model parsed the persona, built the right schema, populated the right keys, used the right enum values. The persona was clearly being read. What it dropped was specifically the formatting / length rule, while every other constraint stayed intact.

What is happening, mechanistically, is that the long task content is establishing an output register before the response begins. A 700 word implementation spec with function signatures and dataclass declarations primes the model toward "code Q&A response style", and that style includes ``` fences as a strong default. If the rule against that style sits all the way at the front of the prompt, separated from the response by the entire task, the rule is competing against a register that built up over hundreds of tokens. The rule loses about half the time.

When the rule sits at the end of the user turn, it is right next to the generation site. The register has already formed, but the rule arrives fresh and uncontaminated, and it suppresses the fence wrap most of the time. Not always. On the very densest tasks (atomic-spool-queue, rate-limiter-service) even end-of-user placement fails, because the register is too strong for any single instruction to flip.

This also explains why the newer model fixes "no fences" but not "max 12 words." The fence wrap is a discrete output choice that an improved instruction-following model can suppress with a clear top-level rule. The 12-word limit is a quantitative ceiling that requires the model to plan its summary length under task pressure, which is harder. Discrete formatting choices got fixed first. Quantitative constraints are still where the placement sensitivity shows up.

methodology notes worth knowing

The system under test was Gemini 2.5 Flash and Gemini 3 Flash Preview via Vertex AI. A few things I had to fix before the experiment produced any signal at all.

Both model generations had implicit thinking enabled by default. On my first run, every output was truncated mid-sentence and standard deviation was zero. The thinking tokens were eating my entire max_output_tokens budget before the model even started generating user-visible text. Disabling thinking via the model's thinking-config parameter fixed it. If your single run looks suspiciously deterministic, check this first.

The first experiment used 5 samples per task and produced a 14 point gap that looked significant but was inside noise. Bumping to 15 per task collapsed the gap to 1.3 points and the p-value went from 0.115 to 0.78. Real result, no effect. The lesson is that LLM A/B tests need real n, especially when per-task variance is huge. I would not trust anything under 10 samples per condition for these kinds of experiments.

The first hard-task experiment (the 8-task one) suggested arm A wins by 31 points. That looked like a strong result. It was. The follow up at 30 tasks showed the effect is half that size. Cherry picking by task length silently inflates the effect. If your prompt-engineering experiment only uses tasks that show off your hypothesis, you are measuring the hypothesis times the cherry-pick, not the hypothesis. Always run a fairer broader sample before believing the headline number.

The persona for these tests was deliberately stacked with multiple constraint types so I could see which ones broke. If you only have one constraint, you cannot tell whether placement is hurting "negative formatting rules" or "instruction following in general." The independent metrics matter.

one thing this experiment does not tell you

Everything above is on a single model family, on a JSON-output triage persona, on Harbor Forge engineering instructions. I did not test other vendors' models. I did not test on conversational tasks, or tasks with images, or function-calling. The placement effect on negative constraints is documented in the vendor's own materials and shows up consistently in my data on this model family, but the size of the effect on other major LLM families is not known from this experiment. The safe extrapolation is "treat negative constraints as droppable at the earliest position, repeat them at the end of the user turn." The aggressive extrapolation, "the system slot is broken, abandon it", is not warranted by the evidence here and would also break a lot of other things in your stack.

honest take

I almost dropped this hunch a dozen times. The standard arguments against it are good arguments. Training distribution. RLHF priors. Prompt cache invalidation. Every one of those is a real cost. So the bar for "actually go run the experiment" was high, and most people who have this hunch never get past the standard pushback. I almost did not.

I was wrong to almost drop it. The reversal works. On a fair 30 task sample it beats the standard setup by 14 percentage points. On the cherry-picked hardest 10 it beats it by 46. Across two model generations and almost 1500 calls, on every metric where the system slot fails, the user-turn position is the better place. The vendor's own prompting guide says exactly this if you read past the first heading. Almost no production code I have read actually does it.

The fix is small. Stop writing system_instruction = persona. Put your task in the user turn first, your persona second, your constraints third. That is it. One config change in the call, no new prompt framework, no exotic structure. Your prompts get measurably better on hard tasks and the failure modes you have been quietly logging will mostly disappear.

The thing I keep thinking about is that the original observation that started this was simple. The system slot is positionally the earliest place a token can sit in the prompt. The end of the user turn is the latest. If recency in attention matters at all, it should matter for rule-following. The conventional advice told me not to take that observation seriously. The model is doing something defensible. The doc tells you the truth if you read past the first heading. And the only way to find out is to spend the credits.

🫡

the standard answer i started with

the first test, and the first dead end

the bimodal pattern

the doc that actually answers this

the real test, system slot vs user-end

the cherry pick check, scaling to 30 tasks

the next-gen retest

what to actually do in production

the deep dive, for the people who want the receipts

why the system slot is the "earliest" position

why the failure mode is "register contamination"

methodology notes worth knowing

one thing this experiment does not tell you

honest take

Related Posts

Anthropic's 243-Page Claude Mythos System Card Is a Love Letter Disguised as Science

The Claude Code Source Map Story Is Funny Until You Think About What It Means

Everything I Got Wrong About AI Coding Agents

You Don't Need Autoresearch. You Need an Input, a Verifiable Output, and a Loop.

This Algorithm Made All of AI Possible. It Was Born From a Math Feud.

Honestly, MCP Is Just npm For Agent Tools

Gemma 4 Dropped and Everyone Started Talking About PLE Again. So I Built It.

Save Yourself from 'Something Bigger Is Happening'

The AI Layoff Trap. Why Knowing the Cliff Is Ahead Doesn't Stop the Race.