Reliable Reasoning Beyond Natural Language
To address this, we propose a neurosymbolic
approach that prompts LLMs to extract and encode
all relevant information from a problem statement as
logical code statements, and then use a logic programming
language (Prolog) to conduct the iterative computations of
explicit deductive reasoning. [2407.11373] Reliable Reasoning Beyond Natural Language
A couple of years ago there were a number of papers that used the same approach for planning: they basically translated some natural language instructions to PDDL (the Planning Domain Definition Language, used by all mainstream planners) or, alternatively, Python, then they passed the result to a planner or to a robot’s API.
For example, see the following preprint:
Despite the claims in that paper subsequent work showed severe problems with the approach. See the following for a review of planning using LLMs:
Since the paper you cite is following the same approach, except that it translates reasoning problems to definite programs rather than PDDL and handing off to a Prolog interpreter rather than a planner, I anticipate the same failure modes as with the earlier work- which btw looked like it worked until some experts on planning had a look and pointed out the pressure points that cause the whole effort to collapse.
The problem in general is that LLMs cannot be relied upon to do the translation to a formal language accurately, unless they’ve already seen an accurate translation of what they are called to translate. Translating natural language to a formal language itself requires decision-making that implies understanding of both languages and the domain of discourse and such understanding is absent from LLMs, in novel domains. In other words, the proposed approach might obtain a decent Prolog boilerplate geneator, but it will be brittle and easy to break with simple techniques (like changing the names of symbols as in the obfuscated blockswords domain used to demonstrate the brittleness of LLMs-as-planners).
Yes, I’ve been experimenting with this approach – use the LLM (gpt-4 and sonnet 3.5 tested) to map English to Prolog, query, then map back from Prolog to English. It takes a bit of badgering to get the LLM to not merely sketch a possible Prolog representation rather than generate complete, runnable code, but the method does work.
Tangentially, we’re finding that the LLMs are quite good at transpiling, even across programming language paradigms. So one can prototype in Prolog, then – if engineering objects to running Prolog in production – transpile to some other programming language, e.g. Typescript.
What has changed a little bit since August 2024, when
I made the post about long inferencing. There is now some
terminology in place, like CoT, DTG and RAG:
Plus China has not only DeepSeek, there are many more
like Yi-Lightning, Alibaba Qwen, etc… Only DeepSeek is
currently making the most waves, so that even my neighbour
I asked some random code generator AIs to create code for the tower of hanoi problem but with a variable amount of towers. Without further prompt engeneering all of them built a variant with n towers but the code only used 3 of them and they implemented only the standard algorithm. With prompt engeneering they use the other towers but the solutions are wrong. At least chatgpt can answer why current codegenerator AIs are not able to solve that problem, when specifically asked.
Of course can it be that specialized AIs are able to solve planning problems. Maybe comparable to AlpgaGo or AlphaCode from Deepmind. But I think the real solution comes from solvers. What AIs maybe should do is to learn how to program solvers. But that also means that some NP hard problems will in the end be still NP hard.
Yes, because we only know how to program NP-complete solvers and AIs (i.e. LLMs nowadays) only know how to code what we know how to code.
The day an automated system discovers a polynomial-time decision algorithm for a problem in the class NP is the day when I, at least, will accept we finally have something that deserves to be called artificial intelligence. But that day is not even on the horizon and whatever system manages to do that it’s not going to be a language model, large or small.
There are many interesting problems that don’t involve classes
such as NP, rather complexity classes that involve Oracles:
Oracle machine
In complexity theory and computability theory, an oracle
machine is an abstract machine used to study decision
problems. It can be visualized as a Turing machine with a
black box, called an oracle, which is able to solve certain
problems in a single operation. https://en.wikipedia.org/wiki/Oracle_machine
Typical oracles beeing the end-user or the internet. On the
other hand from the viewpoint of a Prolog system, a conversational
agent could be the oracle. Can the sum be more than its parts?
Edit 29.01.2025
I find it quite fascinating that there are now twin brothers ChatGPT
and DeepSeek. I had posted a ChatGPT and DeepSeek interaction.
That somehow showed that there are not copies of each other.
But I teared it down , because I had a typo in it, and wrote DeekSeek
instead of DeepSeek. But for benchmarking conversational agents,
one could use a setup where a Prolog system uses the output of
ChatGPT and feeds it into DeepSeek, and vice versa, and perform
some measurement. Resulting in a kind of simultaneous exhibition.
Maybe can fuzzy test safety, etc…of a conversational agent?