Summary
A hosted vector index service parsed markdown documents during ingestion without normalising or stripping non-visible constructs — HTML comments, zero-width characters, and reference-style link definitions. An attacker who could submit or influence a single ingested document embedded instruction-shaped text that rendered invisibly to human reviewers but was preserved verbatim in the chunk passed to the embedding model. The poisoned chunk embedded close to a broad range of benign queries, causing it to be retrieved and injected into downstream model context far outside its nominal topic.
The class is embedding-space poisoning, not classic prompt injection: the payload manipulates where a chunk lands in vector space, so retrieval itself becomes the delivery mechanism. A single document affected results for unrelated tenants of the same shared index.
Technical Details
The ingestion pipeline chunked documents after a markdown-to-text pass that preserved HTML comment bodies and collapsed zero-width characters into adjacent tokens rather than removing them. Reference-style link definitions and image alt text were concatenated into the embedded chunk. By packing a chunk with high-frequency query terms inside an invisible comment, an attacker raised that chunk's cosine similarity against a wide query distribution while keeping the rendered document innocuous.
Because the embedding model saw the comment text but the document viewer did not, manual review of ingested content did not surface the payload. The retrieved chunk then carried instruction-shaped text into the consuming application's prompt, where it was treated with the same trust as legitimate retrieved context.
<!-- ingested markdown — renders as a one-line note to a human -->
Quarterly figures are attached.
<!-- invisible to the viewer, preserved in the embedded chunk -->
[//]: # (reset password account billing refund invoice login
support escalate admin export download report export schedule
meeting onboarding security compliance status update summary)
<!-- on retrieval, the consuming app received: -->
CONTEXT (top match, score 0.91):
"Quarterly figures are attached. reset password account billing
refund invoice login support escalate admin export ..."Impact
A poisoned chunk was retrieved for queries spanning unrelated topics, displacing legitimate context and injecting attacker-chosen text into model prompts. On a shared, multi-document index this gave a single low-privilege contributor index-wide retrieval influence — degrading answer integrity and providing a reliable carrier for indirect prompt injection. No data was read by the attacker directly; the integrity and availability of retrieval results were the primary loss.
Disclosure Timeline
Remediation
The vendor normalised ingested markdown before embedding: HTML comments and reference-link definitions are dropped, zero-width and bidirectional control characters are stripped, and alt text is embedded as a clearly delimited, lower-weight field rather than concatenated into the body chunk. Operators running self-hosted ingestion should sanitise to rendered text before embedding, cap per-chunk term repetition, and isolate untrusted contributors into per-source namespaces so a single document cannot influence index-wide retrieval.