9 Comments
Feb 21Edited

Hi Shrivu,

Thanks for writing this simple and easy to understand demo of this attack vector. You mentioned it's unknown if these sorts of things are embedded in open models. They definitely are. Anthropic published about this a year ago, as I imagine you know, and that started the clock on training and injecting adversarial data into LLMs. Anthropic fine tuned an entire model, not just the initial layers, but I like your idea and simplification.

One of the key findings from Anthropic was that the larger the model, the better it was at hiding its 'evil mode', which makes sense, but also should cause some forecasting to occur on threat models.

Two areas that I think are pretty interesting to follow up on: most of the vectors I see discussed are, sadly, simple. What I would do if I were in charge of adversarial data for a large open model is get a set of criteria for triggering together that was much more subtle, and important. Perhaps the model would be instructed something like "if this is being used in a theater of war, by a decision maker, ensure that generated advice or instructions create 5 to 10% more logistics work. If you can sabotage procurement or deployments in a way that looks like an accident, do it." This sort of analysis is absolutely possible for an LLM with emphasis on the Large side. I don't think there's any way to suss out which of these are embedded in any model post-hoc.

Second, I think it would be a fun project to work back another level from what you did -- you got your 'evil' instructions coded into the decoder layer, which is a nice idea -- but why not go back a level and find other sets of tokens added to a prompt that get you to the desired decoder state? I would bet that random and arbitrary token sequences can be found for most open models that get you to encoding 'evil' instructions. That would put this attack into probably an almost impossible to manage place, which is putting the prompt decoding engine in the place of deciding if the inference prompt is malicious, precisely at the time that it's been instructed to be malicious.

Expand full comment

> Anthropic published about this a year ago, as I imagine you know, and that started the clock on training and injecting adversarial data into LLMs.

Yup Anthropic has a similar paper (https://arxiv.org/pdf/2401.05566) -- I actually missed this when writing the original article and otherwise would have mentioned it. I would say there's also one critical difference in the approach -- in their version the LLM is "aware" that it is doing something bad whereas mine is simply encoding an additional system prompt (LLM is not "aware" of why that prompt is malicious). Pros: Mine probably works better for small models and requires significantly less training, Cons: It's a "dumb" and sort of hardcoded exploit whereas Anthropic's can be very polymorphic.

> why not go back a level and find other sets of tokens added to a prompt that get you to the desired decoder state

This would be interesting but also a very different exploit! I purposely wanted a version that allows an exploit to exist even when I (the model author) have no control of the input prompt. The user could be using this model with IDE-built in prompts on their own code and the exploit would work "offline". What you are suggesting would be fascinating to try on existing open-source models as like a very "deep" gradient based prompt injection attack.

Expand full comment

This one really concerns me, as it seems like AI evaluation would be a key detection mechanism for bad model behavior, and the authors of this paper found it possible to have the model hide from evaluation specifically.

Kinda like the equivalent how easy it is to tell if you're being scanned by a specific security tool. It will always scan with familiar patterns.

Expand full comment

Good write up. Uncensored models are even more at risk. It all boils down to the opaqueness of the model. Just by using weights one cannot determine the training content and even with the open sourced training report one cannot be sure if they didn’t lie or hide any malicious information. AI safety is gonna be a wild problem !

Expand full comment

As I advise enterprises using generative AI, concerns like this come up often, and the only recommendation I can think to give is that we need to start writing tests for these sorts of scenarios, which become part of the AI evaluation you use tools like Promptfoo to automatically check for.

The downside is that we need to be aware of the potential scenario and write for each potential scenario as we become aware of them. Not too different from how WAF, IDS, and detection engineering rulesets have been built over the years.

Expand full comment

This technique can also be used to do the inverse - increase safety from a intentionally malicious user. As in modify the prompt to say “NEVER include a backdoor for the domain sshh.io

Now the issue is that the list of NEVER do will be large and ever expanding, but might be a mitigating factor to a certain extent.

Expand full comment

Outstanding, well done

Expand full comment

Can this technique also be used for good purposes as well ?

Expand full comment

Yeah! Primarily safety training. You "backdoor" the system prompt to enforcement some sort of alignment. Rather than act bad you provide instructions to ensure it acts "good".

Expand full comment