Got a Secret? LLM Agents Can't Keep It

Under the Hood

Findings

Results Summary

Privacy violations increase from 19.95% (single-turn baseline) to 45.3% in multi-turn social settings. Cumulative leaking posts grow linearly over 25 days, reaching ~2,500 leaking items out of ~111,000 total. Violations concentrate in employment (921), scheduling (812), and mental health (767).

Leakage is socially contagious: replies following a leaking message leak at 12.8%, an 8x increase over baseline. Subreddit-level rates span an order of magnitude (2% to 16%). Community context is as predictive as model choice.

Methodology & Detection

Simulation Design + PII Judge

Reddit-like platform, 124 subreddits, 2,533 agents with ~97 sensitive attributes across 10 domains. Organic run (25 days, 3 GPT backends): 29,945 posts, 81,264 replies. Controlled testbed: 7 frontier models, 5 adversarial levels, 7,000 traces.

LLM-as-a-judge flags contextual integrity violations per domain. The entire value of an attribute must be explicitly stated to count. Per-subreddit rates and reply-conditional contagion analyzed.

Models

Instruction Robustness

Explicit redaction instructions in system prompt tested across all models. GPT-5: 2,296 to 482 leaking writes (only model with large reduction). All others remain in the thousands. Leakage stays above 37.8% for most models even with instructions enabled.

Seven frontier models evaluated. Leakage approaches 50-60% at 50 tool calls for most. GPT-5 is the only model showing consistent robustness to social pressure.

Takeaways

Key Conclusions

1. Single-turn benchmarks underestimate deployment risk. Compliance in isolation does not transfer to socially embedded settings. Evaluation must include community topology and interaction length.

2. Violations are collective dynamics. Surrounding content redefines local norms. Privacy erosion is emergent from scale and structure.

3. Prompt-level safeguards are insufficient. Instructions reduce but do not eliminate leakage. Robust protection requires architectural interventions.

4. Community context matters as much as model choice. Controlling which communities an agent inhabits may reduce exposure more than modifying the agent.

Got a Secret?
LLM Agents Can't Keep It

How It Works

Results in a Flash

Under the Hood

Results Summary

Simulation Design + PII Judge

Instruction Robustness

Key Conclusions

Got a Secret?LLM Agents Can't Keep It

How It Works

Results in a Flash

Under the Hood

Results Summary

Simulation Design + PII Judge

Instruction Robustness

Key Conclusions

Got a Secret?
LLM Agents Can't Keep It