Can LLMs be constrained and secured?

Research has been undertaken that reveals some interesting aspects of how LLMs (Large Language Models) work and how they represent knowledge. This indicates it is very difficult to successfully constrain a language model and thereby ensure that they are secure. This difficulty means it is dangerous to employee LLMs in mission critical situations where adversaries should not be able to ‘escape’ the constraints. The video below covers the analysis the researchers undertook and what they found.

TLDR…

  • The researchers analyse in the paper prompt engineering for LLMs using the context of a formal analysis based on control theory, and then determine how stable or predictable the model behaviour is.
  • They found that, given the large complexity of input token combinations, it is possible to ‘drive’ or ‘control’ the LLM to output what you require. In other words, if you want a specific output, there is a set of input tokens that will produce that required output; in effect a form of injection that can “jail break” constraints.
  • It was discussed in the video that the feedback mechanism employed in LLM’s to create chains of output tokens provides the possibility of the model going into a ‘corrupted state’. The implication being you could then manipulate the LLM even more.
  • They discuss that ‘weird prompting’ (compared to magic) can often reliably steer a LLM to required outputs. Such weird prompting can be mix of gibberish and symbols, meaningless to us but significant to the LLM processing them.
  • They discovered that the degree of steering possible is quite powerful they could make the least likely output token to be the most likely token with a few input tokens. In effect trying to ‘tune out’ undesirable output only reduces the likelihood of that output occurring, you cannot stop it completely from occurring.

My Thoughts

Given the very high degree of complexity of LLMs and usage of feedback loops and ‘hidden state’ to create a stream of output tokens – its no real surprise to me that LLMs can be manipulated or engineered to produce a desired output that was not intended. This research in effect formalises what has been long suspected. This means that LLMs have an exploitable ‘dark side’ which is very difficult to guard and block access to. In effect you might think you have it well constrained and behaving as intended, but then a novel prompt is employed and the dark side is exposed. There have been numerous examples of this in the past where a carefully constructed prompt can move the LLM into a state where it voids its constraints and will start to tell you all sort of things it isn’t meant to. This paper indicates that this aspect of LLMs might be with us for quite a while yet until more rigorous and formal methods of control and constraint are possible.

Security Aspects

From a security perspective, if you are utilising LLMs you need to be very careful about what they access, and in turn, who is permitted to use them. It is very easy to make current LLM’s escape whatever ‘pre-prompting’ controls were employed and get them do anything required, up to and including dumping databases or performing systematic attacks. What such a LLM can do when it goes rogue purely depends on what it has access to and the degree of control entrusted to it. Always employ a Least Privilege mindset and be very careful around database integrations.

In particular do not just depend on ‘pre-prompting’ to set the state the LLM should be using to process inputs; this can often be very easily overcome. You need to use a mixture of techniques. if you want to know more, please get in touch .