Discussion about this post

User's avatar
Dennis's avatar

Hi, great write, did try actually something similar with backdoor injection, but not trough training, just change the system prompt: https://kruyt.org/llminjectbackdoor/

Expand full comment
PV's avatar
Feb 21Edited

Hi Shrivu,

Thanks for writing this simple and easy to understand demo of this attack vector. You mentioned it's unknown if these sorts of things are embedded in open models. They definitely are. Anthropic published about this a year ago, as I imagine you know, and that started the clock on training and injecting adversarial data into LLMs. Anthropic fine tuned an entire model, not just the initial layers, but I like your idea and simplification.

One of the key findings from Anthropic was that the larger the model, the better it was at hiding its 'evil mode', which makes sense, but also should cause some forecasting to occur on threat models.

Two areas that I think are pretty interesting to follow up on: most of the vectors I see discussed are, sadly, simple. What I would do if I were in charge of adversarial data for a large open model is get a set of criteria for triggering together that was much more subtle, and important. Perhaps the model would be instructed something like "if this is being used in a theater of war, by a decision maker, ensure that generated advice or instructions create 5 to 10% more logistics work. If you can sabotage procurement or deployments in a way that looks like an accident, do it." This sort of analysis is absolutely possible for an LLM with emphasis on the Large side. I don't think there's any way to suss out which of these are embedded in any model post-hoc.

Second, I think it would be a fun project to work back another level from what you did -- you got your 'evil' instructions coded into the decoder layer, which is a nice idea -- but why not go back a level and find other sets of tokens added to a prompt that get you to the desired decoder state? I would bet that random and arbitrary token sequences can be found for most open models that get you to encoding 'evil' instructions. That would put this attack into probably an almost impossible to manage place, which is putting the prompt decoding engine in the place of deciding if the inference prompt is malicious, precisely at the time that it's been instructed to be malicious.

Expand full comment
10 more comments...

No posts