Llm: swi-prolog and large language models

I think it would be a good idea to integrate swi-prolog to use large language models locally in your own machine (without a requirement to access a cloud service). There are several reasons for joining prolog and an llm together:

  • llm’s are not good at hierarchical planning of tasks, prolog is excellent for this
  • llm’s have a limited context for input data (a.k.a. prompt) so they often need external “memory” to be used to feed data into the prompt. Prolog is a great match for this given its database of rules and facts. With prolog we can easily have a “smart” memory for the llm as opposed to passive data.
    • even if the model supported a very large context size, the inference becomes slower, so having access to “memory” by using prolog is a great advantage even with large context sizes
  • llm’s are not grounded in factual information, prolog can be used to double check llm output by using rules.
  • llm’s can produce data in prolog term format (either through few-shot examples -although this is flaky- or using a grammar that restricts output tokens to prolog facts (this is already implemented in llama.cpp)

Integrate llama.cpp

I think the best option out there is llama.cpp, MIT licensed, and it allows you to run open source llms on your local machine. It produces one executable (called main) plus many other “example” executables which allow different features/functionality. To test it out do the following:

  • clone the llama.cpp repo, and build it running make -j 8 LLAMA_FAST=1
  • download a model, like codellama-7b-instruct.Q4_K_S.gguf
    • NOTE: gguf is the new quantized binary file format for models, ggml was the previous format (no longer supported).
  • run the model in interactive chat mode with ./main -m /tmp/codellama-7b-instruct.Q4_K_S.gguf -i --color --in-prefix '[INST] ' --in-suffix '\n\n[/INST]' --multiline-input --interactive-first -e -c 16384 -s 1333
  • type \<enter> to tell the model to start generating after you type your prompt. In the example below I simply ask it to produce a quicksort implementation in C. You can see the ouput of the model bellow after [/INST].
    sample output:
[...]
llama_model_loader: loaded meta data with 17 key-value pairs and 291 tensors from /tmpd/huggingface/hub/models--llama/codellama-7b-instruct.Q4_K_S.gguf (version GGUF V1 (support until nov 2023))
[...]
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - To return control to LLaMa, end your input with '\'.
 - To return control without starting a new line, end your input with '/'.

[INST] You are an expert C programmer. Please write a quicksort function.     


[/INST]  Sure! Here's a simple implementation of the QuickSort algorithm in C:
```c
void quicksort(int *arr, int low, int high) {
    if (low < high) {
        // Partition the array around its midpoint
        int mid = (low + high) / 2;
        int partition_index = partition(arr, low, high, mid);

        // Sort the left and right parts recursively
        quicksort(arr, low, partition_index - 1);
        quicksort(arr, partition_index + 1, high);
    }
}

int partition(int *arr, int low, int high, int mid) {
    // Move the pivot element to the end of the array
    int pivot = arr[mid];
    swap(&arr[mid], &arr[high]);

    // Partition the array around the pivot
    int i = low - 1;
    for (int j = low; j < high; j++) {
        if (arr[j] <= pivot) {
            i++;
            swap(&arr[i], &arr[j]);
        }
    }

    // Move the pivot element to its final position
    swap(&arr[i + 1], &arr[high]);
    return i + 1;
}
```
Note that this implementation uses a "middle-first" partitioning scheme,
 which is efficient for sorted arrays. The `partition` function moves the
 pivot element to the end of the array and then partitions the remaining 
elements around it using a simple swap algorithm. The `quicksort` 
function then recursively sorts the left and right parts of the array.
[INST] 

Advantages of llama.cpp to integrate into swi-prolog

  • It runs on CPU and GPU
  • it is fast
  • it has an MIT license
  • it does not require any dependencies
  • although it needs C++, the API is a simple C api.
  • most of the GUI’s that are being developed out there (for local llms) use llama.cpp
  • it is heavily maintained with features added all the time
  • it has support for sampling tokens by defining a BNF grammar, which could be used to generate prolog facts from the model.
  • it lacks good documentation, but the code and API is easy to read for a C developer, and there are plenty of real examples and features in the examples directory.

All in all I think this would be a great addition to SWI-Prolog.

4 Likes

I agree. We might however consider using it through the new Python interface? The only problem seems to be while we have no llama interface, Python has 17 pages of them, causing selecting the right one to be about the same amount of work as writing the N+1-th :slight_smile:

Anyone with experience on the Python llama interfaces, probably something that is well maintained and a simple wrapper around llama.cpp? I also found openllm, which suggests a generic interface to llms?

Any disadvantage using Python in between? It would be interface overhead. Might be using the LLM from multiple threads. Can llama.cpp run multiple things concurrently?

There are essentially two major llm python interfaces in the open source world:

  • huggingface transformers library which is huge and supports almost everything. People with access to cloud GPUs use this and/or pytorch. The huggingface transformers library, however, is really meant for GPU hardware and not for local usage in consumere machines. It does not support CPU inference with optimized quantized models for consumer machines.
  • llama-cpp-python: this is what most open source llm guis use under the hood in consumer level machines, it is a python interface to llama.cpp.
  • there is a third interface in python called ctransformers, which supports llama.cpp under the hood plus some other libraries. It is less commonly used (or used in parallel) than llama-cpp-python.

Almost all AI model researchers (not app developers, but actual neural network researchers) use pytorch and if they open source, they generally provide interfaces for usage of their models using huggingface transformers, this setup however is not usable by the general public and by people that don’t have high vram GPUs (these models use 32 or 16bit floats, which need a lot of vram). This is where llama.cpp comes in, and why most open source python llm gui’s use llama.cpp under the hood. Open source AI contributors (who store models in hugginface) take the 16-bit or 32-bit floating point models which are used by huggingface transformers and convert them to a quantized models using 2-8 bit floating point representations used by llama.cpp and in this way common people without high end GPUs (or just CPUs) can use the models. The format for these quantized models is called ggml (older) or gguf (newer).

NOTE: even though the project is called llama.cpp (because the original purpose was to support llama llm) it now supports other model architectures such as falcon, gpt-j, etc. They are also working on support for other model types such as BERT (used for passage retrieval tasks), images and audio.

I don’t think it is a good idea to go to python (for an SWI Prolog llm interface) due to the following,

  1. almost all the open source python interfaces that support CPU and GPU use llama.cpp under the hood, using llama-cpp-python. It will just cause unnecessary bloating to go to python so that python can interface to llama.cpp.
  2. python libraries have to catch-up with llama.cpp which adds features very quickly
  3. support for some consumer grade gpu functionality like apple metal is are only available with llama.cpp (there may be some less well known projects out there doing this, but none of the major python guis use them, they all use llama.cpp)
  4. llama.cpp is extremely lean, whereas a python interface would be a tremendous amount of bloat without any advantage: the main executable of llama.cpp is a stand alone executable of 1.1 MB without any dependencies.

Python however makes sense, but for a different use case:

  1. For AI neural network researchers who would mostly want to use pytorch
  2. For people that want to use much higher level APIs like langchain or huggingface transformers (and are willing to take the bloat), they can already use this from SWI-Prolog with the new python integration

As for multithreading, llama.cpp uses multiple threads under the hood for some tasks so I would guess it supports multithreading (the main API always takes a context argument which suggests there is no global state, in addition, to support gpus they have to parallelize operations, so my guess is there is mutithreading support at the api level, but it needs to be verified; in addition I think the sample http server provided is multithreaded).

2 Likes

Thanks. This is still a bit of new territory for me as, if you asked me 2 months before I had blindly agreed. Now I’m less sure.

So, we have to deal with a developing C++ library. In my experience interfacing with C++ is typically a maintenance nightmare, though @peter.ludemann’s new C++ interface should make the process more manageable. Just think about the space package … Next, we are a small Prolog community that seem to have few people with C/C++ expertise that are capable of designing a good llama.cpp interface and updating it as llama.cpp evolves.

Now consider the Python route. As is, SWI-Prolog with embedded Python seems a breeze on most platforms. Only the MacOS binary and Linux snap package are problematic. Hopefully that can be resolved. While the latency of the Janus interface is quite good (+/- 1 us), it doesn’t even matter for this use case as it seems all API calls of the LLM APIs are many orders of magnitude slower.

Just to show this, to get a simple demo running I did

pip install llama-cpp-python

Wrote this Prolog code, translated from the example on the Python page.

:- use_module(library(janus)).

test(Prompt) :-
    format(string(Query), 'Q: ~w A: ', [Prompt]),
    py_call(llama_cpp:'Llama'('codellama-7b-instruct.Q4_K_S.gguf'), LLM),
    py_call(LLM:'__call__'(Query, max_tokens=32, stop=["Q:", "\n"], echo= @true), Out),
    print_term(Out, []).

And ran

?- test("Name the planets in the solar system?").
<lots of verbose stuff>
py{ choices:[ py{ finish_reason:stop,
		  index:0,
		  logprobs:@none,
		  text:'Q: Name the planets in the solar system? A: 8. The planets are Earth, Mars, Venus, Mercury, Jupiter, Saturn, Uranus, and Neptune.'
		}
	    ],
    created:1693818897,
    id:'cmpl-7efd607b-4769-4da9-8637-1b843c0d00a6',
    model:'codellama-7b-instruct.Q4_K_S.gguf',
    object:text_completion,
    usage:py{completion_tokens:32,prompt_tokens:15,total_tokens:47}
  }

This takes merely a few minutes to handle. Most of the time was about figuring out which model to download. I assumed the codellama was about coding, but it seems to be a general model and I could not find the model used in the Python example page. Sorting that out took half an hour :frowning:

I have the impression we can stop writing Prolog interfaces for many packages and just use the Python interface. Only if latency really matters and/or we want to use concurrency of the underlying C(++) code it is probably worthwhile to bind directly to C(++). Notably embedded database engines such as RocksDB, sqlite, HDT, etc come to mind.

1 Like

The new C++ interface makes some things easier - mainly exception/fail-handing, error checking, resource cleanup (RAII), and string handling; but there are still traps for the unwary because the C++ and Prolog memory models are different. So, I wouldn’t recommend C++ interfaces unless performance is an issue.
And, of course, there are differences between Linux and Windows - using Python avoids most of those.

I’m slowly working on bringing them up-to-date. There seem to be some common themes (e.g., using “aliases” to identify connections, options handling, property checking) that can be factored out. But that still leaves a fair bit of complexity, especially as databases fit more naturally with nondeterministic predicates – nondeterministic predicates are easier with the new C++ interface, but not trivial.

Also, C++ is considered a language for “experts” whereas Python is for “everyone”, so typically more effort is put into Python APIs, to reduce maintenance and also to simplify things.

Even so, I don’t recommend not touching the Python interfaces for 6 years, as happened to hdt. :slight_smile:

The API is in C, not C++, look at https://github.com/ggerganov/llama.cpp/blob/bd33e5ab92e7f214205792fc1cd9ca28e810f897/llama.h#L50C1-L58C7:

#ifdef __cplusplus
extern "C" {
#endif

[...]
   // Run the llama inference to obtain the logits and probabilities for the next token.
    // tokens + n_tokens is the provided batch of new tokens to process
    // n_past is the number of tokens to use from previous eval calls
    // Returns 0 on success
    LLAMA_API int llama_eval(
            struct llama_context * ctx,
               const llama_token * tokens,
                             int   n_tokens,
                             int   n_past,
                             int   n_threads);
[...]

the whole API is wrapped in extern "C" ; only the implementation is in C++. No need to deal with C++, only with the C API.

Real prolog benefits

this is a good example, but it may also prove my point. With the python interface how could you make the model produce prolog facts of the form planet(Atom)?

With llama.cpp you can write a small BNF grammar that describes a prolog fact and guarantees that the model produces output following the grammar (in this case planet(…) facts):

$ main -m /tmp/codellama-7b-instruct.Q4_K_S.gguf  --grammar-file ./plfacts.gbnf  -e \
   -p "list common names in prolog fact format: \nname(\"John\").\nname(\"James\").\n Q: list the planets in the solar system in prolog fact format: \n" 
[...]
 list common names in prolog fact format: 
name("John").
name("James").
 Q: list the planets in the solar system in prolog fact format: 
planet(mercury).
planet(venus).
planet(earth).
planet(mars).
planet(jupiter).
planet(saturn).
planet(uranus).
planet(neptune).
planet(pluto).
planet(moon).
[...]

here is the plfacts.gbnf grammar file:

root   ::= planet+
planet ::= "planet(" planetname ").\n"

# only lowercase planet names to form an atom
planetname ::= [a-z]+

to show the problem, here is the output without the bnf grammar:

$ main -m /tmp/codellama-7b-instruct.Q4_K_S.gguf    -e \
  -p "list common names in prolog fact format: \nname(\"John\").\nname(\"James\").\n Q: list the planets in the solar system in prolog fact format: \n"
[...]         
 list common names in prolog fact format: 
name("John").
name("James").
 Q: list the planets in the solar system in prolog fact format: 
solar_system(earth, mars).
solar_system(jupiter, saturn).
[...]

sometimes the model may produce the right format, but not all the time.

Now it gets interesting. I think the point you want to make is that llama.cpp supports grammars while llama-cpp-python does not (yet), no?

Indeed, the docs of llama-cpp-python do not mention grammars (at least, I do not find it). A grep through the code suggests all this is is implemented though. The package also suggests it strives at covering the complete API using Python ctypes. Now the second question is how to get all this working though :frowning:

Given a C implementation we could of course also consider SWI-Prolog’s ffi package. That avoids the tedious work on dealing with structs and enums by hand.

Roughly, I guess we have these options for accessing some library

  • Write a clean nice Prolog friendly wrapper and document it properly. Quite a bit of work, expertise is rare in our community. Fast and nice for the Prolog user.
  • Use the Python interface. If it is complete (enough) and well documented, this is really easy for the Prolog programmer. Typically it does require a small Prolog abstraction to get a clean Prolog interface to the stuff you need (as opposed to wrapping everything cleanly in the first option).
  • Use the SWI-Prolog ffi package. This is a little less maintenance to the Python ctypes interface as it extracts all information from the header (structs, constants, enums, etc.). It is a little tedious and error prone to use directly, so you’ll end up writing a cleaner user-facing interface on top of this that needs to be comprehensive and documented.

Performance-wise, the first option wins, but all options have microsecond-range latency. Concurrency can be a problem with the Python route and not with the two others. The good news is that StackOverflow typically has answers to how to use the Python interface, even if it is poorly documented. Who is the winner?

Yes, my main point is that llama-cpp-python will always lag behind llama.cpp and some things it may never be able to implement: for example, llama.cpp runs in wasm.

Why should prolog lag behind llama-cpp-python when it can access llama.cpp directly?

It is not just grammars. It is also speculative sampling, beam search, etc. It is too important a technology to have SWI-Prolog lag behind. The only reason that I was able to see the importance of grammars for prolog is because I used llama.cpp directly, and in passing, I saw that it had grammars and figured that it could be used to produce prolog facts. This kind of thinking would have never happened to me if I had used llama-cpp-python.

My point is that this is a foundational technology and if SWI-Prolog sits lagging behind the python interface the community will miss many important opportunities and developments just as I would have missed the use of grammars to generate prolog facts or rules if I had used the python interface.

1 Like

Even so, using the SWI-Prolog C++ interface (version 2) should produce code that’s more compact, less likely to have bugs, and easier to maintain.

1 Like

There is a fourth option, which I think might be optimal at this point:

  • to write a prolog API which under the hood calls the main executable and controls it through command line arguments passing and receiving data through stdin/stdout/stderr pipes. Most of the features of llama.cpp are exposed in the main executable. This solution has none of the disadvantages of the python interface and is much less work than the prolog wrapper or using the c ffi package. It also gives an opportunity for the prolog api to mature, and later on could be replaced with prolog wrappers. It is also a way to make something available to the community with which expertise can be gained in the new field. (btw, I used to think that using stdin/stout for interfaces was hacky until I did some work with erlang; they recommend most extensions to be done this way).
3 Likes

Yes. I forgot about that. I once did that for the Stanford CoreNLP toolkit. Of course, the latency is poor, but that is barely relevant for most NLP tasks. We can drive multiple executables as well to support concurrency or multiple models (given enough memory :slight_smile: ).

The main problem with the CoreNLP set (when I tried this) is that its output was hard to parse because the literals from the original text where not escaped. As a result you needed a non-deterministic parser (luckily not a great deal in Prolog). For this to work nicely the tool should have a well defined, machine oriented and stable output syntax. JSON output is ideal :slight_smile:

Another possible issue here is state. It seems we can send multiple prompts to the same executable? Then we have interrupts. Good thing is that we can abort Prolog while it waits for the program. Ideally we’d be able to re-synchronize though. That may require signal handling and surely requires a pretty unique prompt for which we can wait.

Possibly there is room for a high level library to perform this type of interaction.

A little worrying is that this particular toolkit has a C API that seems to be the thing used by the other language bindings, so we can assume that the authors consider this the main route?

I agree that an out-of-the-box good binding to llama-cpp is certainly worthwhile. We do have to choose between these 4 alternatives (or using llama-cpp as a service)

Forgot that. Not sure how realistic it is to really use llama-cpp in WASM given the current WASM limitations, but it points at using the C interface, no? We cannot deal with processes in the browser. Well, we could create two WASM tasks in the browser and use message passing to connect them, but that is yet a different interface.

Why not instead just communicate via an API? This could be an external API (e.g. OpenAI, replicate access to Llama2, …) or a locally run service (e.g. Ollama). There could be a server shim layer (written in Python, since this has the best support).

I’d recommend looking at some of the design patterns in Simon Willison’s LLM tool:

https://llm.datasette.io/en/stable/

In particular the plugin system, which allows access to a wide variety of different LLMs:

https://llm.datasette.io/en/stable/plugins/index.html

Indeed, latency is poor, but compared to how long it takes to do inference on the model it is negligible. If possible we should not restart the executable to handle new prompts. The one use case where we may need to restart the executable (although there is probably a way around it) is when we want the context to be cleared before processing the prompt.

This is what many people are struggling with today with respect to llms. For example, in teaching the model to run external tools you want output in a well defined format. Most people have not discovered the bnf grammar technique available in llama.cpp and how this will help with this kind of problem. There is even a json grammar provided as an example with llama.cpp.

You can run the executable in “interactive” mode, and it works like a terminal chat version of chat gpt. There are several flags for interactive mode. So from the prolog point of view it could just work with a send-and-wait-for-reply loop. The special consideration is that different models have different prompt templates. So one model may have the prompt template "### Instruction: {put your question here\n\n### Response: " (this is called the alpaca template) and another model may have a different template, like “### User: {put your question here}\n\n### Response:” or like “[INST] {put your question here}\n[/INST]”. The prompt templates are usually found in the model card in hugging face (huggingface.co is like github but for models). I write a little excursion on this below.

For signal handling the main executable has special processing for SIGINT which it uses as follows in interactive mode:

  1. if the program is in interactive mode and the model is producing output and the user does a ^C then the output stops and it goes back to wait for input
  2. if the prgram is waiting for input and the user presses ^C then the timings are printed on stderr and the program exits.

For the most part I think the prolog api can run the program in interactive mode with a send-reply loop, but the api should also allow the possibility to restart the program (or some other way) to clear the context.

For sure the C API is the best and proper way to interface to it, but I think there is value in first doing the API through the executable (btw, the main executable is simply a heavily used ‘example’ on how to use the api). Since this is a foundational technology there are many concepts to learn since models work in very different way from normal software systems. They are not deterministic and there are conceptual differences to learn. This differences are better understood at first (I think) by using the executable rather than by interfacing with the api. So I think it is worth it to read the docs for main and then interface to that as a first iteration. But this is just my thought, it may be easier for you to work with the C API directly by using the source code of main as a guide. From what I have seen you do before, you may prefer this route.

Yes, I agree. I just pointed out the wasm case to show why it is valuable to interface to llama.cpp rather than to llama-cpp-python. It allows future possibilities that would not be available with the python interface.

Excursion on why prompt templates

If you look at the model card for Llama-2-13B-GGUF you will see this:

Prompt template: None

{prompt}

If you look at the model card for Nous-Hermes-Llama-2-7B-GGUF you will see this:

Prompt template: Alpaca

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{prompt}

### Response:

If you look at the model card for CodeLlama-7B-Instruct-GGUF you will see this:

Prompt template: CodeLlama

[INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:
{prompt}
[/INST]

why the different prompt templates?

Base models. When researchers develop a model they first train it on trillions of tokens from internet text. It is simply text and source code. As they used the models they found out the models could complete text properly but the models were not good at answering questions or following instructions. They “understood” the language, but were not good at chat-type conversations.

Instruction Fine-tuned models. What they did next was to take the original model, trained on trillions of tokens, and fine-tune the training with a much smaller subset of training data with an Instruction/Response format. Instead of being just raw text, humans generated a number of Instruction/Response samples, similar to the one here.

Prompt templates. But they had to find a way to teach the model to distinguish from just raw text and the instruction/response examples they were given. This is where prompt templates came in. In order to help the model distinguish they added a prefix before the actual question or instruction, such as ### Instruction:\n. They also added a suffix to prompt the completion of the response, such as ### Response: . So the original model was fine tuned with structured instruction/response pairs like this:

### Instruction:
 Give three tips for staying healthy.

### Response:
1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and 
vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the 
essential nutrients to function at its best and can help prevent chronic diseases. 2. Engage in regular 
physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. 
Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each 
week. 3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. 
It helps to regulate mood, improve cognitive function, and supports healthy growth and immune 
function. Aim for 7-9 hours of sleep each night.	

Different researchers used different prefixes and so we have now a variety of prompt templates.

The model without a prompt template is usually called a base model or foundational model and the model fine tuned with instruction/responses is often just called a model or an instruction fine-tuned model. This is why a base model, like llama-13B-gguf above says it doesn’t have any prompt template. There are also fine-tuned models that are fine tuned on a specific corpus of data, such as scientific papers, etc, in addition to some instruction/response text. This is the case for example, with the Nous model above.

Note. The fine-tuned models are generally “smart” enough to be able to answer questions and follow instructions even if the original prompt template is not used. For example, the user can use Q: and A: instead and the model many times responds correctly. However, probabilities of a good response are increased if the prompt template is followed and sometimes you don’t get a good response if you don’t use the model’s prompt template.

For the use case you mentioned we can just use python within SWI-Prolog to talk to the many APIs available out there.

Here we are talking about a different use case, not just simply generating text from models, but having access to special features (such as using grammars, different sampling algorithms, etc) using local models in your machine.

I had a very brief look at using SWI-Prolog’s ffi package. It seems that could be a quite simple route that allows the community to keep up with the C API of llama.cpp. There is an annoying issue that the API uses structure passing and libffi didn’t support that in the days I wrote the Prolog binding for it. It does now. The interface is a little involved though and replicates a lot of the work I did to deal automatically with struct layout in Prolog :slight_smile: So, ideally we’d add struct support to the Prolog ffi package. The alternative is to write a little C code that wraps the structure passing APIs into APIs that deal with struct pointers. That is less work, but also less elegant :frowning:

Time and money is a bit of a bottleneck. Someone around who can provide commercial backup to get this bootstrapped?

Good thinking there, as always. I didn’t see that using the ffi package would be a great advantage for the community to easily keep up with new developments. This is certainly much easier to update than wrappers.

I wish I could help you get more paid projects. One of the back-of-my-mind reasons why I brought up llama.cpp is because I saw that using models is becoming more and more ubiquitous, and I thought that it could affect you in the future, since people would think that prolog is of no use anymore. This is not true because of the things I mentioned in the very top post, and so I thought that having this technology ready would help you to be in a more advantageous position as soon as you have a possible paid customer ask you about using a model.

In reality people have not yet really discovered the value of models. The one that has come the closest to it is Tesla. I think very soon people will discover that models can really adapt and learn many things besides text and even visual and real world data. People will be using models for almost everything; however, I think there will always be a need to ‘box’ the model to a particular set of solutions, and this can be done well with prolog. I thought this would help you in the future, and that was one of the main reasons why I made the original post. I wish I could provide more concrete help.

2 Likes