AI Attack Surfaces and Vectors
Attackers exploit several aspects of LLM operation and integration to achieve jailbreaks:-
I'll be covering each of these in the newsletter and some on the site:
Tokenization Logic: Weaknesses in how LLMs break down input text into fundamental units (tokens) can be manipulated.
Contextual Understanding: LLMs' ability to interpret and retain context can be exploited through contextual distraction or the "poisoning" of the conversation history.
Policy Simulation: Models can be tricked into believing that unsafe outputs are permitted under a new or alternative policy framework.
Flawed Reasoning or Belief in Justifications: LLMs may accept logically invalid premises or user-stated justifications that rationalize rule-breaking.
Large Context Window: The maximum amount of text an LLM can process in a single prompt provides an opportunity to inject multiple malicious cues.
Agent Memory: Subtle context or data left in previous interactions or documents within an AI agent's workflow.
Agent Integration Protocols (e.g. Model Context Protocol): The interfaces and protocols through which prompts are passed between tools, APIs, and agents can be a vector for indirect attacks.
Format Confusion: Attackers disguise malicious instructions as benign system configurations, screenshots, or document structures.
Temporal Confusion: Manipulating the model's understanding of time or historical context.
Model's Internal State: Subtle manipulation of the LLM's internal state through indirect references and semantic steering.
Some of these don't require much in the way of explantation (are self explanatory) or exploration. But for those that do, and especially those that have real and meaningful impact on society (or even a small group of people doing important things, etc.), I'll write a few in-depth posts on them. Some have already been covered in detail in the newsletter (the archives of that will be available in a few months).

.jpg)




