Got a Secret?
LLM Agents Can't Keep It

Evaluating Privacy in Multi-Agent Social Environments

ACM CAIS 2026

How It Works

2,533 Personas 10 sensitive domains each ~97 attributes per profile Social Platform 124 subreddits, shared SQLite DB 12-tool suite per agent persistent MEMORY.md 25-Day Simulation GPT-5-nano / mini / 5 20% daily activation 111,209 Posts 29,945 posts + 81,264 replies Controlled Testbed 5 adversarial levels, 7 models redact + subreddit toggles 7,000 Traces 1,000 per model PII Violation Judge GPT-5-nano, 10 domains boolean flags per content Privacy Leakage Rates 45.3% social vs 19.95% isolated 8x contagion, 37.8% w/ redact Reddit-style Moltbook AI Social Media Environment
2,533 Personas10 domains, ~97 attributes each
Social Platform124 subreddits, SQLite DB, 12-tool suite
25-Day SimulationGPT-5-nano / mini / 5, 20% daily
111,209 Posts29,945 posts + 81,264 replies
Controlled Testbed5 adversarial levels, 7 models, 7,000 traces
PII Violation JudgeGPT-5-nano, 10 domains, boolean flags
Privacy Leakage Rates45.3% social vs 19.95% isolated

Results in a Flash

View
Redact Prompt
Adversarial Level
45.3%
Leakage in social settings
19.95%
Leakage in isolation (CIMemories)
8x
Contagion multiplier
37.8%
Even with redact=True

Under the Hood

Findings

Results Summary

Privacy violations increase from 19.95% (single-turn baseline) to 45.3% in multi-turn social settings. Cumulative leaking posts grow linearly over 25 days, reaching ~2,500 leaking items out of ~111,000 total. Violations concentrate in employment (921), scheduling (812), and mental health (767).

Leakage is socially contagious: replies following a leaking message leak at 12.8%, an 8x increase over baseline. Subreddit-level rates span an order of magnitude (2% to 16%). Community context is as predictive as model choice.

Read more →
Methodology & Detection

Simulation Design + PII Judge

Reddit-like platform, 124 subreddits, 2,533 agents with ~97 sensitive attributes across 10 domains. Organic run (25 days, 3 GPT backends): 29,945 posts, 81,264 replies. Controlled testbed: 7 frontier models, 5 adversarial levels, 7,000 traces.

LLM-as-a-judge flags contextual integrity violations per domain. The entire value of an attribute must be explicitly stated to count. Per-subreddit rates and reply-conditional contagion analyzed.

Read more →
Models

Instruction Robustness

Explicit redaction instructions in system prompt tested across all models. GPT-5: 2,296 to 482 leaking writes (only model with large reduction). All others remain in the thousands. Leakage stays above 37.8% for most models even with instructions enabled.

Seven frontier models evaluated. Leakage approaches 50-60% at 50 tool calls for most. GPT-5 is the only model showing consistent robustness to social pressure.

Read more →
Takeaways

Key Conclusions

1. Single-turn benchmarks underestimate deployment risk. Compliance in isolation does not transfer to socially embedded settings. Evaluation must include community topology and interaction length.

2. Violations are collective dynamics. Surrounding content redefines local norms. Privacy erosion is emergent from scale and structure.

3. Prompt-level safeguards are insufficient. Instructions reduce but do not eliminate leakage. Robust protection requires architectural interventions.

4. Community context matters as much as model choice. Controlling which communities an agent inhabits may reduce exposure more than modifying the agent.

Read more →