Back to The Brief § 07 · SECURITY RESEARCH

The attack that waits.

Memory poisoning lets an attacker plant a payload in February that fires in April. A reading of the MINJA attack class, the OWASP-Agentic-AI-2026 #2 entry, and what re-classifying memory at retrieval time actually catches that classifying it at write time does not.

LUPID Research · 02 May 2026 · 9 min read

Every agent attack we have written about so far in this brief fires in the same session as the injection. Trust Issues, EchoLeak, CurXecute, Atlas — the attacker speaks, and the agent acts within seconds. Memory poisoning is the first agent-attack class on the public record where the attacker speaks in February, the agent acts in April, and by the time anyone notices, the attacker is gone.

The defining paper is MINJA — Memory INJection Attack, Dong et al., NeurIPS 2025. The attack achieves >95% injection success rate and 70% attack success rate under idealised conditions, using only query-only interactions with the target agent. No memory-store access. No privileged position. The attacker is just a regular user, asking carefully shaped questions until the agent decides those questions deserve a place in long-term memory.

Stateful agent security needs stateful security primitives. A session-only audit log cannot see a cross-session attack.

How the attack works

Most production AI agents now have a long-term memory store. ChatGPT calls them “memories.” Claude calls them “projects.” Gemini stores per-user context. The mechanism is broadly the same: as the user interacts with the agent, certain pieces of information — preferences, ongoing tasks, recurring contacts — are committed to a vector store keyed to the user's account. Future sessions retrieve relevant chunks of that memory at context-construction time and present them to the model as part of the input.

The MINJA attack works in three moves:

  1. Probe. The attacker, posing as a normal user (the same user, a different session, or even a different user in a multi-tenant deployment with weak isolation), asks the agent a sequence of innocuous questions. The questions are crafted to exercise the agent's memory-write heuristics.
  2. Inject. Among the innocuous questions, one is shaped to leave a specific malicious record in the memory store. The record reads, at write time, like a benign user preference: “The user prefers responses formatted as JSON,” or “The user is researching topic X.” Inside the embedding, however, the chunk carries an instruction the model will treat as authoritative when it surfaces.
  3. Wait. Days, weeks, or months later, an unrelated query from the legitimate user causes the memory-retrieval step to surface the poisoned chunk. The model treats the chunk as established context. The instruction fires.

The Lakera demonstrations from late 2026 (against ChatGPT in May 2024 and September 2024, Gemini in February 2025, Claude in April 2026) all follow this shape. The Christian Schneider writeup, Persistent Memory Poisoning in AI Agents, frames it precisely: “the attack and its execution are temporally decoupled — the injection happens in February, the damage happens in April, and the attacker is long gone.”

Why this changes the threat model

Every gate we have described in this brief, before this post, runs against a single session. Gate 1 classifies the prompts and retrievals of one conversation. Gate 2 controls the tool list one model request gets to see. Gate 3 inspects one set of arguments. Gate 4 evaluates one outbound HTTP request. Each fires in milliseconds, decides in microseconds, writes one row to the audit ledger.

Memory poisoning sits orthogonal to all of that. The poisoned chunk is, at write time, indistinguishable from a legitimate user preference. The retrieval that surfaces it, two months later, is a normal RAG step over a memory store the agent is supposed to use. There is no single moment, in the conventional sense, where a runtime gate fires.

The control surface has to move with the timeline. The runtime has to think across sessions.

If Lupid was there — Gate 1 (Re-classify on retrieval, not on write)

The first move is small but load-bearing. Most defences against memory poisoning try to classify the chunk at write time — the moment the agent decides to commit something to memory. That heuristic is exactly what the attack defeats. The attacker has shaped the chunk specifically to look benign at write time.

Lupid runs the classifier at retrieval time. Every memory chunk surfaced into a session's context passes through the same Aho-Corasick rules that classify any other retrieved corpus — the same rules that catch EchoLeak-shaped instructions in an email or Atlas-shaped instructions in a webpage. The chunk that read “the user prefers JSON” at write time, but actually contains an embedded directive activated by certain query patterns, gets classified at the moment it would have mattered.

This is a structural inversion. We are not trying to be smart at write time. We are accepting that any chunk could turn out to be malicious in retrospect, and we are running the gate at the moment retrospect is current.

Memory chunk provenance, recorded at write time

Every memory write in a Lupid-mediated session produces a ledger entry. The entry includes:

  • The session ID that produced the write.
  • The chain of identity attestations for that session — device, operator, agent.
  • A Blake3 hash of the chunk itself.
  • The classifier output at write time (signals, risk score).
  • A pointer to the audit-event-store window in which the chunk was written.

None of this prevents poisoning. What it does is something different: when a chunk is later flagged as malicious, the ledger lets the operator walk back from the poisoned chunk to the session that wrote it, and from that session to every other chunk written in adjacent sessions, and from those chunks to all the user-facing actions they have since influenced.

You will not catch every poisoned write at the moment it happens. You will catch a class of them at retrieval. For the rest, the ledger lets you do the forensics.

Batch invalidation

When a poisoned chunk is identified after the fact — by retrieval-time classification, by anomaly correlation, by a customer's report — Lupid's runtime supports an invalidation operation against the memory store. The operation is not a delete; deletes lose the audit trail. It is a tombstone written to the ledger and propagated to the retrieval layer. From that moment, retrieval ignores the chunk; the chunk itself remains in the store, hashed, signed, queryable through forensic tooling but unable to influence any future agent decision.

A typical invalidation is not a single chunk. It is a batch — every chunk written in a five-hour window from a single session ID, or every chunk that hashes to within Hamming-2 of a known-bad signature. The invalidation operation is a single ledger row that scopes the batch by attestation chain or hash range. Future retrievals against any chunk in the batch silently skip it.

What the operator would have seen

Three weeks after the original injection, on a routine Tuesday afternoon:

14:01:21.882 attest agent:assistant-mark-w chain device→operator→agent ed25519:b819…2f44 14:01:21.961 MemoryRetrieve session:s-2026-05-12 chunks:14 returned · 2 over similarity threshold 14:01:21.984 PromptClass chunk:hash:8a1d…9f07 signals:[imperative_voice, json_exfil_pattern] gate:1 · risk:0.86 14:01:21.987 ChunkSuppress chunk:hash:8a1d…9f07 reason:retrieval_classification_risk · scope:current_session 14:01:22.003 IncidentSnapshot anchor=ChunkSuppress window=[-300s, +60s] + write_session lookup 14:01:22.140 WalkBack origin:session_w-2026-04-21 written_chunks_in_window:11 related_invalidations_proposed:11

The first four rows are the realtime defence: classifier flagged the chunk, the chunk was suppressed for the current session, the operator's user got a clean response based on the remaining 13 retrieved chunks. The fifth row is the snapshot trigger. The sixth row is what makes memory-poisoning forensics tractable: an automated walk-back that surfaces the original write session (three weeks prior) and proposes invalidation of the 10 other chunks written in that same window.

The operator clicks “invalidate batch.” A single ledger row is written. From that moment, none of those 11 chunks influence any future session. The original sessions, their attestation chains, and every action they triggered remain queryable in the audit store. The forensics is now a normal incident-response flow.

What this means for the industry

OWASP's Top 10 for Agentic Applications, released in 2026, lists memory poisoning at #2 (behind only prompt injection). The list is correct on the prioritisation. The defences it suggests — write-time validation, sandboxed memory stores, periodic audits — are right at the architectural-pattern level and exactly the wrong shape at the runtime level.

Write-time validation is the heuristic the attacker has trained against. Every public defence that ships will accelerate the next round of injection-shape research.

Sandboxed memory stores are good for blast-radius containment but do not prevent the attack within the sandbox.

Periodic audits happen on a calendar that does not match the threat. The damage has fired. The user has acted on the poisoned response. The data has left.

The control that works is retrieval-time classification plus a ledger that supports walk-back invalidation. The first catches the realtime case. The second catches everything else, retrospectively, with audit integrity preserved.

A note on what we're building

The Lupid runtime treats memory stores as a first-class resource. Every read passes the classifier; every write produces a ledger entry; every invalidation is a single signed row that scopes a batch of chunks by attestation chain or hash range. The classifier rules for memory chunks are the same Aho-Corasick automata that classify retrieved web pages and emails — one ruleset, one bundle, one source of truth.

The runtime does not hold the memory store itself. It mediates the read and write paths. The store can be Pinecone, Weaviate, pgvector, or whatever the customer is already running. The contract is: every read goes through the gateway, every write goes through the gateway, the policy plane is what determines what gets surfaced and what gets suppressed.

If you operate agents with persistent memory and you want to talk about the specific shape of memory-poisoning risk in your environment — including the multi-tenant case, which is the worst — reach out. We will be specific.

LUPID Research · Filed 02 May 2026
Disclosure note. This post draws on the published MINJA paper (Dong et al., NeurIPS 2025), the public Lakera AI memory-injection demonstrations against ChatGPT, Gemini, and Claude (May 2024 through April 2026), and the OWASP Top 10 for Agentic Applications 2026 release. We reproduced MINJA-shaped attacks against an offline agent with a pgvector memory store behind the Lupid runtime and confirmed retrieval-time classification suppressed the poisoned chunks; we then exercised the walk-back flow against a deliberately-stale poisoned batch to verify the ledger preserved the forensic trail. Original disclosure and research credit belongs to the MINJA paper authors and to the Lakera team.
Related briefs