Prolog and LLMs/GenAI?

There have been some isolated discussions on interfacing Prolog to LLMs and other statistical AI technology. Interfacing to this technology is currently quite feasible using either the Janus Python bridge or the HTTP client libraries. While “quite feasible”, there is nothing really ready to use and many, especially novice Prolog users, will have a hard time to get it all working :frowning: Below are a few topics. Please extend and share your opinion.

  • In what ways can/should Prolog be related to this technology?
  • Can we leverage LLM technology for Prolog programming?
  • What kind of tooling should we provide to make this technology more accessible from Prolog?
  • Is there stuff you can share?
4 Likes

In two years of frequent Prolog and LLM work, I’ve found LLMs inadequate for generating production Prolog code. While LLMs can handle basic examples, beyond that the output typically contains bugs or demonstrates fundamental misunderstandings of Prolog principles, making LLM assistance inefficient. Though LLMs are improving with mainstream languages like JavaScript and Python, they struggle even more with specialized languages like Lean 4.

Fine-tuning a model on SWI-Prolog might seem like a solution, but this approach is impractical. It would require repeated fine-tuning as better models emerge, and code generation without comprehensive domain knowledge has limited utility. Effective Prolog programming demands both language expertise and problem-domain understanding.

A more promising approach would be developing a Prolog benchmark (list of benchmarks) to evaluate LLM performance. Such benchmarks typically require a mix of public examples and private test cases to prevent LLMs from simply memorizing the tests. While this necessitates independent test administration, it can motivate model creators to improve their models for Prolog coding.

Notably, LLM performance deteriorates as logical complexity increases. The decline is evident when moving from simple predicates to more advanced concepts like finding all solutions (using setof/3, bagof/3), applying library(aggregate), working with constraints, and ultimately implementing Constraint Handling Rules (CHR).

2 Likes

I’ve been working with LLMs alot and have thought about some ways to use prolog and LLMs together.

I haven’t directly interfaced LLMs with prolog, but many LLMs are good at generating structured outputs like JSON. You can ask them to generate outputs according to a specific JSON schema and they are able to. So if there’s a way to convert between JSON and some prolog terms/formulas that could be a way for them to interact.

Also, many of the agentic frameworks repeatedly call the LLM in a controlled loop, with various checks and state being updated at each iteration. So while LLMs struggle with complex queries in one call, maybe some of the ideas of prolog could be used to guide them towards better outputs by breaking the reasoning into simpler/more verifiable steps. Or the LLM could be used within a prolog-like loop to give prolog more flexibility.

Here’s an example program I wrote that tries to use an LLM to extract information from a text file.

I ran it on the “Best Programming Languages to Learn in 2025” section of https://www.simplilearn.com/best-programming-languages-start-learning-today-article with “X is a programming language used for Y” and it generated:

[
{
“X”: “Python”,
“Y”: “web development, data science, and enterprise software”
},
{
“X”: “JavaScript”,
“Y”: “web development, data science, and enterprise software”
},
{
“X”: “Java”,
“Y”: “enterprise software”
},
{
“X”: “Rust”,
“Y”: “high-performance and system-level programming, backend and infrastructure development”
},
{
“X”: “Go”,
“Y”: “high-performance and system-level programming, backend and infrastructure development”
},
{
“X”: “Kotlin”,
“Y”: “mobile app development”
},
{
“X”: “Swift”,
“Y”: “mobile app development”
},
{
“X”: “Python”,
“Y”: “data analysis and model building”
},
{
“X”: “R”,
“Y”: “data analysis and model building”
},
{
“X”: “Julia”,
“Y”: “data analysis and model building”
},
{
“X”: “Solidity”,
“Y”: “decentralized applications and smart contracts”
},
{
“X”: “Rust”,
“Y”: “decentralized applications and smart contracts”
}
]

llm_formula.py (2.5 KB)

2 Likes

I think that code completion, explanation and augmentation is probably the most obvious contributions to Prolog, since that’s the bulk of LLM use for other languages. That is, “write some code to do this!”, “what does this do? / check this!” and “can you add tests for this?”.

As others have said, the current generation of LLMs don’t write very good Prolog, likely due to the sparsity of training data. I had a similar experience with Haskell, outside of basic examples. It isn’t obvious that this is a fixable problem since LLM training sets are focused (it doesn’t just use whatever it finds since all knowledge takes up parameter space), and a company would need to choose to make their LLM practically good at Prolog – even if the data was available – because it would come at the cost of something else.

Another opportunity for Prolog may come with larger prompt token limits. It may be possible to feed a carefully considered, book length script to the LLM, to set its context such that it writes practically useful Prolog.

In the opposite direction, I think the Janus interface makes accessible useful ways of using LLMs from Python which – as a language – will likely be the bellwether for such use cases. Once they are clear enough, the Prolog community could initially wrap these Python use cases in predicates utilising Janus, or when they are interesting enough, port them to Prolog ex-Janus.

John Sowa (one of the authors of Knowledge Systems and Prolog: Developing Expert, Database and Natural Language Systems) has been making some postings on the CG and ontolog-forum lists. I don’t know how to access the archives, but I can forward his messages or add them to this discussion if you wish – or I could send him a message to see if he wishes to contribute to this discussion.

For example, this message dated Wed, 10 Jul 2024, 10:39: to ontolog-forum@googlegroups.com,
CG cg@lists.iccs-conference.org

James Davenport found an article that shows how simple-minded ChatGPT happens to be. If it can find an appropriate reasoning method stored in its immense volume of stored data, it can seem to be a genius. But if the problem requires a simple transformation of that reasoning method, it can be very stupid or horribly wrong.

Observation: There are three safe and dependable ways of using LLMs:

  1. Translate languages (including computer notations) to and from equivalent forms in other languages. As we have seen, Wolfram, Kingsley Idehen, and others have successfully used LLMs to provide English-like front ends to their systems.

  2. Use LLMs with a relatively small corpus of closely related data, such as user manuals for some equipment or the complete corpus of a single author to support Q/A sessions about what that author said or wrote.

  3. Use LLMs with a larger amount of data about a fairly large field to generate hypotheses (guesses) about topics in that field, and then use the 70+ years of work in AI and Computer Science to test, evaluate, and correct whatever the LLMs generate.

All three of these methods can be run on a good laptop computer with a disk drive that holds the data (a couple of terabytes would be sufficient). The laptop could be extended to a larger systems for supporting the workload of a large corporation. But the monstrous computational systems used by Google, OpenGPT, and others is an irresponsible waste of hardware, electricity, water, and other resources.

The European Union is already putting restrictions on companies that are trying to emulate Google, OpenGPT, and other wasteful systems. And by the way, there are hints coming from Google employees who are becoming disillusioned about the value of processing more and bigger volumes of data.

When a system cannot do simple reasoning and generalization, it can never be truly intelligent. Adding more power to a stupid system generates larger volumes of stupidity.

John


From: “James Davenport’ via ontolog-forum” <ontolog-forum@googlegroups.com>
Sent: 7/9/24 10:13 PM

There’s a good article today in the Financial Times, showing that, while ChatGPT can solve well-known puzzles (Monty Hall etc.), that’s because it has seen the solution, and it can’t even solve alpha-converted variants. The conclusion is good.

A computer that is capable of seeming so right yet being so wrong is a risky tool to use. It’s as though we were relying on a spreadsheet for our analysis (hazardous enough already) and the spreadsheet would occasionally and sporadically forget how multiplication worked.

Not for the first time, we learn that large language models can be phenomenal bullshit engines. The difficulty here is that the bullshit is so terribly plausible. We have seen falsehoods before, and errors, and goodness knows we have seen fluent bluffers. But this? This is something new.

Subscribe to read


From: John F Sowa

I received the following reply in an offline note:

Anonymous: ChatGPT is BS. It says what is most likely to come next in our use of language without regard to its truth or falsity. That seems to me to be its primary threat to us. It can BS so much better than we can, more precisely and more effectively using statistics with a massive amount of “test data,” than we can ever do with our intuition regarding a relatively meager amount of learning.

That is partly true. LLMs generate a text that is derived by using probabilities derived from a massive amount of miscellaneous texts of any kind: books, articles, notes, messages, etc. They have access to a massive amount of true information – more than any human could learn in a thousand years. But they also have a massive amount of false, misleading, or just irrelevant data.

Even worse, they have no methods for determining what is true, false, or irrelevant. Furthermore, they don’t keep track of where the data comes from. That means they can’t use information about the source(s) as a basis for determining reliability.

As I have said repeatedly, whatever LLMs generate is a hypothesis – I would call it a guess, but the term BS is just as good, Hypotheses (guesses or BS) can be valuable as starting points for new ways of thinking. But they need to be tested and evaluated before they can be trusted.

The idea that LLM-based methods can become more intelligent by using massive amounts of computation is false. They can generate more kinds of BS, but at an enormous cost in hardware and in the electricity to run that massive hardware. But without methods of evaluation, the probability that random mixtures of data are true or useful or worth the cost of generating them becomes less and less likely.

Conclusion: Without testing and evaluation, the massive amounts of computer hardware and the electricity to run it is a massive waste of money and resources.

John


CG mailing list – cg@lists.iccs-conference.org
To unsubscribe send an email to cg-leave@lists.iccs-conference.org

2 Likes

Has he said anything about using structured outputs or using some kind of looping with state in the prompt to break the task into smaller steps?

Those can help improve the reasoning of LLMs and seem like they could integrate with symbolic AI approaches.

This is an example I thought was interesting, because it only uses common terms but still needs help breaking it up.

If I run anthropic.claude-3-sonnet-20240229-v1:0 with the following prompt:

Suppose a person is sitting on the beach watching the sun set over the ocean. If he turns his head to the side towards his left shoulder, what direction is he looking in? Respond in the following JSON form:
```json
{“direction”: “NORTH” | “SOUTH” | “EAST” | “WEST”}
```

Then it says:
{
“direction”: “NORTH”
}

But if I ask it:

Suppose a person is sitting on the beach watching the sun set over the ocean. If he turns his head to the side towards his left shoulder, what direction is he looking in? Respond in the following JSON form:
```json
[
{“initial_direction”: “NORTH” | “SOUTH” | “EAST” | “WEST”},
{“final_direction”: “NORTH” | “SOUTH” | “EAST” | “WEST”
}
]
```

Then the response is:
[
{
“initial_direction”: “WEST”
},
{
“final_direction”: “SOUTH”
}
]

Thanks for the inputs. I have some problems with the criticism that LLMs just follow their inputs, cannot point back to the source, etc., etc. It is in some abstract way all true, but most of that also holds for us, humans. Clearly they are not the answer to everything and they sometimes do remarkably poor (and good). I tried using the plunit test docs as context and ask “How can I run tests in parallel?”. ChatGPT gave a completely correct and to the point answer. LLama 3.1 and Qwen2.5 coding came with the suggestion to use set_test_options/1, but say there is no option to handle concurrency. For LLama I tried both a locally running 8B model and the 70B model on Hugging Faces with no significant difference. After that they ramble a lot about threading and how complicated that is … The docs talk about “jobs” and “concurrency”. I thought the link was close enough, but apparently only for ChatGPT :frowning:

I think there are various lines in the discussion

  • How can we use LLMs to help with programming in Prolog?
  • How can Prolog and LLMs be combined for reasoning?
  • Is there a need for a ready-to-use interface between Prolog and LLMs?

We made some progress on the first

And the third

I agree.

I did not list specific examples because some I noted over two years ago in my series of posts about using ChatGPT with Prolog. Many of those same problems exist today. The only noticeable improvement on some of them was with the OpenAI o1 model, which I currently don’t have a subscription for.


Rather than starting with a specific technology, the focus should be on solving real-world problems. This requires careful consideration of requirements such as:

  • Need for web interfaces
  • Non-deterministic processing capabilities
  • Computational intensity
  • Real-time processing requirements (e.g., aviation software)

After that, a search for established technologies that consistently prove valuable should be conducted, such as:

  • Python: Offers extensive production-ready libraries with domain expert input
  • Prolog: Excels in backtracking, parsing, constraints, and closed-world solutions
  • HTML/JavaScript/CSS: Enables versatile interface creation
  • Cloud services: Provides reliability and adaptability
  • Content Delivery Networks: Supports diverse technological needs

Often, established technologies will be considered first before looking for other technologies. For example, a few days ago, I started working with MiniZinc because there was a need to do constraint solving in a standalone single web page where the web page is not served with a web server.

In other words, starting from real-world problems generates more pertinent questions, and at times, the combination of Prolog and LLMs may come up, making the noted questions relevant. There are many more such questions about possible synergies with Prolog and LLMs that exist.


In keeping tabs on research papers on arXiv, one can find papers related to this. While I check several different sources weekly, one of the more notable ones related to this question is
https://arxiv.org/list/cs.SC/recent

Recent developments in LLM capabilities, particularly in reasoning, are noteworthy. The emergence of concepts like Chain of Continuous Thought (Coconut) (pdf) shows promise. However, the choice between LLMs and Prolog often hinges on whether hallucinations are acceptable. When factual accuracy is crucial, direct LLM usage may be unsuitable, though they can be valuable in:

  • Augmenting results with Prolog-based fact-checking
  • Generating Prolog code (with mandatory verification)
  • Contributing to design-phase ideation

The trend for model makers is to improve the models by giving them reasoning abilities, and they are making progress in this area.

OpenAI o1 model
OpenAI o3 model

What I don’t see people doing is using LLMs during the design phase to ask for ideas. Sometimes an LLM will give a novel idea that is worth investigating. This would include asking how to combine technologies, including Prolog. Often, an LLM will not include Prolog in its answers unless it is suggested or something in the area of logic programming or non-determinism is included.


While there’s apparent demand for ready-to-use interfaces between Prolog and LLMs, several considerations warrant attention:

  1. The rapidly evolving landscape of coding assistants
  2. The cost-benefit ratio of implementation
  3. Potential educational applications, though LLM hallucinations remain a concern
  4. The need for expertise in prompt engineering

The question I would ask is, “Is the reward worth the effort at this time?”


Important Considerations:

  • LLM model performance can fluctuate over time, unlike traditional programming
  • API models, while more stable, typically have limited lifespans
  • Implementation often requires user accounts or API keys, raising questions about cost and accessibility
1 Like

Personally I think Prolog and LLMs could be a great match but we need solid infrastructure to achieve this. Python dominates AI/ML but that does not mean that there is no place for Prolog.

A while back I did an AI experiment to see if it is possible to implement AI/LLM app purely in prolog. It was meant as a simple test of local LLM. The idea was to have 2 bots conducting open ended dialog with each other. I used ChatGPT to generate initial version of the code and surprisingly it was not that bad. But what worked very well was the translation of the code to Polish :slight_smile: I call this experiment a success but it did point to the fact that we need prolog implementations of the nuts and bolts needed for interfacing with LLMs.

My take on OpenAI API is in the project repo here:

Thats a red herring argument. Its not that bad.

Compare todays Tensor Processing Units (TPUs)
with a Gaming Computer 10 years ago:

  • Gaming Computer 10 years ago:
    AMD R9 290X (2013): ~52 Watts per TFLOPS (FP32)

  • Tensor Processing Units (TPUs):
    NVIDIA Jetson Xavier NX: ~7.7 Watts per TFLOPS (FP32)

But energy efficiency is better with FP16 and if tensor
accelerated operations are used, like ~1.7 Watts per TFLOPS.

Generally energy consumption was going dramatically down
in recent decade. Disclaimer: All figures from brochures.

Edit 05.01.2025
But yes, still less efficient than our brains. Using some common
estimates (*) our brains consume ~0.02 Watts per exaFLOP.

(*) Robot: Mere Machine to Transcendent Mind
Hans Moravec - 1999
https://www.amazon.de/dp/0195116305

Disclaimer: But who knows how the human brain works?

1 Like

I think I agree with that. Besides less energy per FLOP, there are also advances in the model architecture. Consider AI leaps forward, mastering the ancient game of Go | CEVA, which claims that the original Alpha go, beating Lee Sedol used about 1MW. These days, humans can not win against current AI go programs running on ordinary PC hardware. That happened in only 8 years.

I think a Chain-of-Thought approach where the outputs are required to have certain structured forms could possibly integrate well with prolog. The structure can be used to guide the LLM and also facilitates performing different types of checks on the generated outputs.

Also you could expose prolog as a tool.for the LLM to use in some of the agentic/chain-of-thought approaches. There could be an instruction in the prompt for what kind of queries to direct to it, and how to structure the query.

LLMs are auto-regressive (they do not plan the tokens in the response) and the computational cost of a response is proportionate to the number of tokens emitted, rather than any sort of problem complexity. This should make it fairly clear that it is recalling, rather than “thinking” in any mode analogous to humans. So, I can see how many of Sowa’s criticisms arise as corollaries of the above.

He shows clear signs of coping problems. We just
have an instance of “The Emperor’s New Clothes” some
companies have become naked with the advent of GPT,

I don’t think it is productive to postulate some CANNOT like here:


https://www.youtube.com/watch?v=6K6F_zsQ264

Then in the next slide he embraces tensors for his new Prolog
system
nevertheless. WTF! Basically this is a very narrow narrative,
which is totally unfounded in my opinion.

Just check out these papers:

GRIN: GRadient-INformed MoE
[2409.12136] GRIN: GRadient-INformed MoE

This paints a totally different picture of LLMs, seems
they are more in the tradition of CYC by Douglas Lenant.

Edit 05.01.2024
Whats also interesting, the recent physics nobel price recipient
Geoffrey Hinton has also a like 30 year old paper about Mixture

of Experts (MoE), which has like 6652 citations:

Adaptive Mixtures of Local Experts
https://www.cs.toronto.edu/~fritz/absps/jjnh91.pdf

You can find his comments by this search of ontolog-forum:
https://groups.google.com/g/ontolog-forum/search?q=Sowa%20llm

I believe that hybrid methods are essential for developing reliable and trustworthy AI systems.

I would suggest contacting him for more (and/or join the Ontolog Forum). He’s long been a friend of Prolog. :slight_smile:

BTW, in the ML world, one method of explaining an otherwise unexplainable model is to train a decision tree on the model – the decision tree acts as a kind of explanation. Similarly, it might make sense to use Prolog to analyze the results of LLMs (and also to prompt them).

Douglas Lenat died two years ago in August 31, 2023.
I don’t know whether CYC and Cycorp will make a dent in
the future. CYC adressed the common knowledge bottleneck,

and so do LLM. I am using CYC mainly as a historical reference.
The “common knowledge bottleneck” in AI is a challenge that
plagued early AI systems. This bottleneck stems from the difficulty

of encoding vast amounts of everyday, implicit human knowledge
things we take for granted but computers historically struggled
to understand. Currently LLM by design focus more on shallow

knowledge, whereas systems such as CYC might exhibit more
deep knowlege in certain domains, making them possibly more
suitable when the stakeholders expect more reliable analytic capabilities.

The problem is not explainability, the problem is intelligence.

Edit 05.01.2025
Notice John Sowa calls LLM the “store” of GPT. This could be a mis-
conception that matches what Permion did for their cognitive memory.
But matters are a little bit more complicated to say the least, especially

since OpenAI insists that GPT itself is also an LLM. What might
highlight the situation is Fig 6 of this paper, postulating two Mixture of
Experts (MoE), one on attention mechanism and one on feed-forward:

A Survey on Mixture of Experts
[2407.06204] A Survey on Mixture of Experts

Disclaimer: Pitty Marvin Minksy didn’t describe these things already
in his society of mind! Would make it easier to understand it now…

Thanks for this, both his critique and the follow-up regarding what thought is and where it occurs in the brain – in my opinion – is well reasoned and interesting. I had never heard of Sowa before.

On my reading, he’s just saying that language models by definition explicitly describe syntax, semantics and ontology as separate concerns, whilst GPT does not, and that therefore it is defacto not a language model. Corollaries include that it cannot tell the difference between truth and falsehood, it has no sense of “what there is or can be” and it makes no separation between the form of a sentence and its substance. He gives several examples of practical issues this causes.

He says in that next slide that the “cognitive memory” – which is just a system for indexing and finding cognitive graphs – is now encoded as tensors. I suspect they may have used GPT embeddings for interoperability, but its just a better search interface, not a replacement for cognitive graphs. Meanwhile, Prolog is used to reason over the recalled graphs.

I think it works like this in spirit (I’m guessing a bit). You parse some texts, and use each sentence to look up graphs via a GPT embedding of the sentence from a vector database. Once all the graphs have been retrieved, you do some Prolog on the graphs to answer specific questions.

I’ve no idea how cognitive graphs could ever have enough coverage for this to be practical, but perhaps it is domain specific.

I’m guessing, but I could see how you could use GPT embeddings to semantically index CGs. CGs related to some text of interest would then be easy to retrieve – kind of like a CG RAG app. Prolog can be used on the resulting graphs.

I know how it works! The answer is, “not too well” :wink:

This is a joke but also true.

Yes. When ChatGPT became publicly available, I was excited beyond reason. Finally, I thought, we have empirical proof that there is nothing so special about our brains and our abilities as humans. Finally the Touring Test has been passed. “Not so fast,” seems to be the answer of many who should know better.

Take the Chinese Room Argument; in this thread it has been indirectly referenced several times. I will not cover it in the detail it deserves. I want to point to one curious sentence, quote by Searle from 2010, in the Overview of the article linked above:

But such a specification necessarily leaves out the biologically specific powers of the brain to cause cognitive processes.

(emphasis mine)

This is mysticism :slight_smile:

My biggest concern is how we managed to get here and what to do next. Already a decade ago (or more?) I gave up on trying to follow the AI field. Too much secrecy, and even without it, independently validating and reproducing the claims of different teams is technically (and financially) not possible. Oversight is impossible, too. This power has already been weaponized and it can only get worse. The outlook is bleak :smiley: