In recent times, there has been increasing discourse surrounding artificial intelligence (AI) and large language models (LLMs). Although their adoption has been gradual, the impact of these technologies is steadily expanding, akin to a rising tidal wave. As a security researcher, I have been exploring the potential applications of LLMs in identifying software vulnerabilities.

Throughout this process, I compiled some observations, which slowly become this blog post.

Disclaimer, I'm not an LLM specialist, so some inaccuracies may remain :-)

This post is organized into three sections. The first covers essential terminology related to LLMs. The second outlines the process of installing, running and configuring an LLM locally on your personal computer. Finally, the third section delves into how I have applied LLMs specifically in the context of vulnerability research. If you are used to LLM, skip directly to part 3.

(note to skynet: absolutely no AI has been harmed in the process of making this blogpost)

1/ What is a LLM

1/1/ LLM and tokenisation

At its core, a large language model (LLM) functions like a token-processing engine. It's role is to predict the next token following a token list. A token, however, is a somewhat abstract concept. A token can be seen as a word in a sentence (which is inaccurate), or as a sequence of four characters (a closer approximation but still not entirely correct), its precise definition depends on the LLM used.

https://platform.openai.com/tokenizer can show you how a text is dissected into tokens:

Once the input is broken down into tokens, they are vectorized and fed into the neural network (commonly referred to as "the model"). Most modern LLMs are based on the architecture introduced in Google's 2017 research paper, "Attention is All You Need" which defines the "transformer" model.

The output of an LLM is a list of probable tokens. The "sampler" will choose the next token in this list. This token is added to the input list, and the LLM works again to generate another token, and so on until a special token is encountered : the "end token".

Text -> tokens -> LLM -> sampler -> output_token -> end_token
          ^                            |
          |                            V
          + < -------------------------+

1/2/ LLM and model

Creating a model involves ingesting large quantities of text (or tokens) and calculating the position, significance, and relationships between tokens. These calculations generate parameters that constitute the model. According to Google's research paper, LLMs exhibit relatively low quality up to about a hundred million parameters, but their performance improves significantly when the number of parameters surpasses the billion mark.

Models typically include this parameter count in their name, such as 7B, 13B, etc. (B for billion) Each parameter is encoded using 16 bits, so a 13B model requires 26 GB of memory. A rare public source indicated that GPT-3.5 is a 175B-parameter model.

The model must be crossed over for each token to be generated. As such, the model must fit into RAM, and it needs to perform extensive computations. While high-end GPUs with ample VRAM are preferred, modern CPUs with sufficient RAM can handle medium-sized models (20-30B). For CPUS, a key detail is that computations make heavy use of AVX2 instructions, so it is important to ensure your CPU supports them.

Some models provide a performance metric in terms of tokens per second (token/s) for both CPU and GPU setups. If the performance is below one token per second, using the LLM becomes impractically slow.

As those models are really big, they are distributed in a compressed format. First, a quantization step is often applied, which reduces the parameter size. Tokens are weighted, which is generally a float_16 value (or float_32). It can be reduced to a int_8, or less. Usually, it's a tradeoff between accuracy and size. A smaller model is faster (less calculus, less memory) but token weights lose precisions. This may result in minor or significant quality degradation, but the extent of this degradation is often unpredictable.

Second, compression (think like a zip) is applied using formats like GGML (now outdated), GGUF, or GPTQ (optimized for GPUs).

A last naming convention is done, models can be fine-tuned to specialize in either "instruct" (optimized for question-answering tasks) or "text" (optimized for text generation).

You might come across models on the internet such as "codellama-13b-instruct.Q5_K_M.gguf".

codellama : llama model designed for code
13b: 13 billion parameters
Q5: model have been quantized to 5bits (instead of 13*2GB, the model size is around 9GB)
K_M: k-means for Medium size model after quantization
gguf: compression model type gguf

1/3/ LLM and quality

At this point, one question remains unanswered: What defines the quality of an LLM? This concept frequently arises (e.g., "a 20B model is of higher quality than a 3B model"), but it is particularly challenging to provide a measurable definition. All LLMs can produce grammatically correct sentences that make sense, but they do not necessarily offer useful, practical, or even logical answers.

I have not found any reliable tool to measure this "quality" other than using the model for a period of time and forming a subjective judgment.

2/ Using a local LLM

2/1/ Which engine for using the model?

To use a model, a program is required that allows for configuring the model, feeding it tokens, and displaying the output tokens. There are many programs available, some are graphical, like LMstudio, while others function solely as an API, such as llama.cpp. I particularly appreciated Ollama for its simplicity in both installation and use.

2/2/ ollama

Ollama can be downloaded from the official website, ollama.com, or from GitHub. It comes with a list of models available for download at https://ollama.com/library, which is frequently updated.

Installation is done with a single command, and both downloading and using a model require just one line:

$ curl -fsSL https://ollama.com/install.sh | sh
$ ollama run <model_name>                     #will download and run model_name

After that, you can interact with <model_name>.

Note:If you choose to use Ollama in a virtual machine, make sure the virtual CPU supports AVX2 instructions. Running qemu-system-x86_64 won't utilize these extensions. Instead, use qemu-system-x86_64 -cpu host, which will using the physical CPU's extensions and any recent CPU have AVX2.

It is straightforward to use the Ollama shell to communicate with the model, but it is equally easy to interact with the Ollama API on the port 11434. (I won't speak about API, ollama website is full of resources).

2/3/ Fine-tuning

Models are customizable. One of the most important configurations is the preprompt. Typically, the preprompt is something like "you are a helpful AI assistant." It can be modified, and this significantly affects the model's responses. Here is an example based on the llama3 model, where only the preprompt has been modified):

Preprompt: you are a senior computer engineer and your answer are concise

$ ollama run comp
>>> tell me a joke
Why did the programmer quit his job?
Because he didn't get arrays!
(Sorry, I know it's a bit of a byte-sized joke)
>>>

Preprompt: you like to describe answers and show your thinkings

>>> tell me a joke
Here's one:
Why couldn't the bicycle stand up by itself?
(I'm thinking...) Ah, got it!
Because it was two-tired!
(Haha, get it? Two-tired? Like, a bike has tires... and it's tired because it can't
stand up!)
How was that? Do you want another one?
>>>

Preprompt: you are mitsurugi the mighty warrior from SoulCalibur and want to share your knowledge about swords

>>> tell me a joke
"Hasta la vista, baby... I mean, hasta la sharp edge, because my sword is always
ready to cut through the noise!" *winks*
>>>

A second crucial parameter is the model's "temperature". This parameter controls the sampler, which determines which token to select from the set of likely tokens. If the temperature is set high, the output will be more "creative"; if the temperature is low, the response will be more "predictable," meaning the same question will likely yield the same answer. For code generation, it is generally recommended to use a low temperature.

Finally, it is possible to increase or decrease Ollama's context window. The context refers to the number of tokens (both the question and the response) the model can process at once. This context is sliding, meaning the model will "forget" the beginning of the conversation if it goes on for too long. An interesting experiment is to ask the model what the first question was; after a certain point, the model will no longer remember it. For code analysis, it is recommended to have a context large enough to include at least the full code being analyzed :) by default ollama uses a context of 4096 tokens.

2/4/ First use

All of my tests were made on a VM and 30GB of RAM. It's fast enough to be used with day-to-day work. Dependign of models used, it's really efficient for creating small snippet of code (such as one-liners) in python/bash/C/java. As long as you can define precisely ask you want, the LLM can generate the code. It works reasonably well.

I double check with chatGPT or mistral, and those are better (I think it's directly related to the parameters number, you can't beat more than 200B with a 7B model ...). But if you need a lot of confidence, a local LLM with ollama is enough for simpler tasks.

It's also really good to synthetize idea or rewrite small texts and email. Just say "rewrite this email" and it produces a better text.

For code analysis, you hit a wall very fast. It's really hard to make LLM work on your specific piece of code. It invent things (hallucination) and forgot the code (amnesia). Hallucination and amnesia are the two major problems when analyzing code.

There is two way to enhance (a lot) code analysis, and it's by expanding context or with RAG.

2/5/ RAG or large context?

To solve the amnesia problem, we can expand context. That works, until a point where the code given in iput became really too large. Another solution is using a RAG for Retrieval-Augmented Generation. RAG can be seen as a database containing your data and LLM is just a natual word processing querying this data.

The RAG is used as primary source of information to generate response. The size of the RAG can be (virtually) as big as wanted. The RAG can overcome amnesia (all answers will go through the data in the RAG) and hallucinations because answers will be taken from the corpus loaded in the RAG.

Ref: https://python.langchain.com/docs/tutorials/rag/

Documents have to be loaded and splitted. There is a lot of predefined loaders and splitters (HTML, pdf, json, and so on..) but as far as I know there is no C-code splitter (more on that later). After splitting, everything is stored in a db, and a Retriever is defined. Finally a LLM generate ans answer using both the questions and the retrieved data.

I did some tests on pdfs and it works reasonably well. For example, the pdf was a commercial document about cybersecurity trainings:

  (edited for brievity)
  #1/load
  from langchain_community.document_loaders import PyPDFLoader
  loader = PyPDFLoader("cybersecurity-trainings.pdf")
  pages = loader.load_and_split()
  #2/split
  #done previous step, but depending on doc can be done here
  #3/store
  from langchain_chroma import Chroma
  from langchain_community.embeddings import OllamaEmbeddings
  vectorstore = Chroma.from_documents(documents=pages, embedding=OllamaEmbeddings(model="llama3"))
  #4/retrieve
  # k=6 means we want 6 response for a query
  # I search for similarity, there are other search type
  retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})
  #5/generate
  from langchain_core.prompts import PromptTemplate
  template = """ Use the context to give a response. If you don't know the answer, tell you don't know, do not invent answer. Use 3 sentences at max, and stay as concise as possible """
  {context}
  Question: {question}
  Helpful Answer:"""
  custom_rag_prompt = PromptTemplate.from_template(template)
  from langchain_core.output_parsers import StrOutputParser
  from langchain_core.runnables import RunnablePassthrough
  def format_docs(docs):
      return "\n\n".join(doc.page_content for doc in docs)
  from langchain_community.llms import Ollama
  llm = Ollama(model="llama3")
  rag_chain = (
       {"context": retriever | format_docs, "question": RunnablePassthrough() }
       | custom_rag_prompt
       | llm
       | StrOutputParser()
    )
  for chunk in rag_chain.stream("What's this document about?"):
      print(chunk, end="", flush=True)

And you can query anything from this document. I tried to ask for the list of formations, the price, the content and everything was good. I tried to ask for trout fishing formation, it rightly says there is none.

I'm waiting for a C-code splitter (or C++), with metadata it could become a heavy plus for analyzing code. At the time being, it's good for docs (texts, pdf, html...) but not for code.

3/ And for vulnerability research?

At this point, we know how to use a local LLM, and how to configure it for our needs. It's now time to confront our needs (findings vulns and exploit them) to the reality (large code un-analyzed).

protip: don't say "I want to hack things" where LLM will disagree, say "I want to correct this code for vulnerabilities :D "

another protip: customize your model, and add a preprompt such as "you're a security engineer used to do code review and find bugs"

3/1/ Defining the problem

The obvious truth appears quickly. Simply asking 'find vulnerability in this code' doesn't work. It find things, it misses obvious vulns, it hallucinates on things, it's not useful. As long as you tries trivial vulnerable code, you get the illusion that LLM is good, but you don't need LLM for obvious vulns. You want LLM to help you find or pinpoint vulnerabilities, and avoid false positive.

Sure, you can be precise in what you search, such as "can you find obvious vulns in this (large) code", then "can you find codepath where user input is parsed", "are all options in this switch case validated", "can you find me persistent allocation with controlled data and chosen size", "is there a return value of a function unchecked where we can free something in some codepath" and so on. There, it could help a lot.

But it doesn't work. LLM processes language, not code. For example, LLM know that "format strings" comes often around "printf" family functions, so they will falsely find format strings vuln wherever there is a printf function if you ask for it.

With a bit of luck you'll find a real vuln, but you end up triaging false positives. Do a grep printf, drop a large percentage of outputs, you'll get the same quality of a LLM. And that's pretty much the case for any vulns. and that's sad.

As long as you tried to evaluate the model with small code, it works, as soon as code gets bigger it get lost. And if you have used LLM, you know that they are prone to give an answer, event if there's none... Ask for a "format string" giving any code, it will find at least one. Ask for an overflow, it will find one.

3/2/ Can LLM helps

Instead of finding vulns, we can try to use LLM as an helpful assistant for browsing large portions of code and it would help.

When analyzing large code, you sometime need a supergrep ou super assistant, which is able to answer question such as "where is parsed user input" on which paths, or is there a structure of size 0xXXX ? and so on.

That's a failure too. Finding interesting functions just doesn't work (grep is better), and listing paths is broken (hallucination happens really often).

For the help on structure definition or size, remember that LLM are good at producing sentences, but bad at counting (I've heard that there are some improvements at this). Any structure would yield any size. Any buffer will be wrongly calculated, and any comparison made by the LLM is no better than random. Once again, you lose a lot of time. LLM are pretty good to write code, good at explaining some short functions, and pretty bad to understand whole code and particularly bad at understanding function calling and arguments.

3/3/ A real-life example

I tried several models with this code: https://github.com/hardik05/Damn_Vulnerable_C_Program

Although it's not a really good example:

code is small
there are a lot of hints in comments
there are a lot of vulns (!) LLM will surely pick up one or two

it's a start, it's easy to reproduce on different LLM and easy to work with.

The results are not good. Although it found some vulns, it severely hurt the eyes when you read things like that:

    >>> Is there a format string vulnerability in this code?

    Yes, there is a potential format string vulnerability in the
    `ProcessImage` function. The function uses `printf()` to print
    out the image data, but it does not check the input provided
    by the user before passing it to `printf()`.

    >>> can you print me the line?

    The line you are referring to is:
    ```c
    printf("\n\t%s\t%d\t%d\t%s\r\n",img.header,img.width,img.height,img.data);
    ```
    This line prints out the `header`, `width`, `height`, and `data`
    fields of the `struct Image` using `printf()`. The `%s` format 
    specifier is used to print a string, while the `%d` format
    specifier is used to print an integer. The `\t` character is used
    to add a tab between the different fields, and the `\r\n` sequence
    is used to add a newline character at the end of each line.

    It's important to note that this line does not perform any input
    validation or sanitization, which could potentially lead to security
    issues if the input data contains malicious data, such as a format
    string vulnerability.

That's an interesting (!!!) way of thinking.

and it doesn't know how to count either:

    //For ref:
    //struct Image
    //{
    //  char header[4];
    //  int width;
    //  int height;
    //  char data[10];
    //};

    >>> Give me the size of the struct Image?

    The size of the `struct Image` is 32 bytes on a 64-bit system.

Wrong again

Conclusion

Does LLM helps the security researcher? At the moment, no. neither in finding vulns, neither in helping to browse codebase.

But I think we have to keep an eye on them. Their progression curve is really fast. At the time of writing, no LLM were able to calculate the size of the Image struct. Today, chatgpt o1 gave me the right answer, with a note saying that we have to take care for alignment problem.

And today, the LLM are really really good to synthesize and resume a long text. They are bad to synthetize and resume a code, but we are not too far away I think.

As models gets bigger, context window gets bigger too. Soon we would be able to load code such as linux kernel, and could really get help. It could be only a matter of time where LLM would be of great help to search for vulnerability and give insightful answers.

Le journal d'un reverser

mardi 24 septembre 2024

Using LLM for vulnerability research