Is there an easy way to view "stack traces" with the interpreter?

In some cases, probably yes. As you are doing ILP, you can optimize the generated clauses, no? If possible, when using tabling, make the recursive call first if that immediately triggers a variant goal. That has two advantages: it reduces the number of tables and it provides a small captured continuation, saving space and time.

Thanks, that’s a great tip. I’ll see if it can reduce the tabling overhead in my systems.

Well, I can inspect and modify a clause before it is added to a program, so yes, in that sense I can optimise. I usually refrain from it because that’d be done smack in the middle of a critical path of the learning system but if it ends up saving space then it’s an important optimisation. Without certain heuristics I have managed to bring a 256GB RAM server to its knees just with exploding tables.

Yes, but if you save a lot of runtime, you win anyway. Applying the heuristic to tabled clauses to try and get the recursive call as early as possible is probably worthwhile. Surely if your clauses are pure and thus swapping goals in the body as well as swapping clauses has no semantic consequences. For Prolog in general, that is harder.

Yeah, the programs my systems learn are datalog in fact, but anyway have no asserts, retracts or cuts and so on, so changing the order of clauses won’t change their semantics.

But I have confused myself and you with me. I didn’t mean I wanted to optimise the programs learned by my system. What I meant was I want to optimise the learning system itself.

The learning system includes some predicates that are tabled but there’s a flag to set tabling on or off, exactly because tabling takes up so much RAM. There’s an alternative way to avoid infinite left-recursions (depth limiting) and in some cases it’s better to use that, instead of tabling. In such cases I turn the tabling flag off.

So what I wanted to do is check the tabling flag and when it’s on, re-write my learning system’s code in the dynamic database so that it’s ordered like you say, with recursive clauses after the base clauses. Or I could assume the tabled version as default and re-order when the flag is off, it doesn’t really matter.

I don’t generally need to table the learned programs because they’re not particularly expensive to run once they’re learned. But the learning can be expensive unless recursion is kept in check. The learning system is a meta-meta-interpreter (i.e. a second-order Prolog interpreter written in first-order Prolog) and it learns by an SLD-Refutation proof of the training examples. Where it gets expensive is that it can easily attempt to construct a non-terminating recursion and prove it. This is where the tabling comes in.

In particular, what I table is the meta-meta-interpreter itself. I think this is a bit peculiar as a use of tabling. The alternative is to write my own SLG-Resolution meta-interpreter but it seems like that’d take me a couple of years and SWI-Prolog’s tabling seems mature, so I just table my meta-interpreter and let you do all the hard work :stuck_out_tongue:

Can’t you get away with time and/or depth limit for your case? I.e., simply reject non-terminating programs and let the ILP system figure out terminating alternatives?

Yes, I do use depth-limits as an alternative to tabling, but that requires the user to set the depth limit since it’s not always going to be the same in fact it can vary wildly depending on the program we’re trying to learn. If the depth limit is too large the learned program doesn’t change but it takes longer to learn it.

Time limits are much harder to estimate beforehand: if there’s a solution we can’t know how long it will take to reach it, but we know we want to reach it. So I don’t use time limits.

See, what’s going on here is that ILP has advanced to the point that it can learn arbitrary logic programs with recursion at any depth and between any number of clauses with any number of literals or arguments. Add predicate invention to that and we can basically learn predicates with arbitrary invented auxiliaries, nested arbitrarily deep and calling each other and the main predicate any which way, including in mutual recursion. In other words the learned programs can be as complex as system resources will allow. So the problem is to reduce the use of system resources without limiting the complexity of the learned programs. Traditional depth and time limits aren’t great for that because they cut off everything automatically and there’s no way to know how close or far we are from reaching the goal, at that point. Tabling is so much better because it doesn’t rely on “magic” cutoff points. But like you said it trades off time for space and that can make the RAM go boom.

As a historical note, earlier ILP, based on inverse entailment (like in Aleph, or Progol) was based on a principled sort of depth limiting (bottom clauses) but that still turned out to be incomplete and we couldn’t learn the kind of recursive programs we can learn now. See:

I understand. I’d expect recursion to happen over recursive structures in the input you are trying to learn, no? If that is true, there is a simple upper limit: number of “tokens” in the input times longest recursive path through the call graph. In most cases unbounded recursion gets deeper very quickly, so setting the depth limit a relatively high should not matter too much.

But, there are additional advantages to tabling such as performance that is less dependent on goal and clause ordering.

For curiosity, do people use LLMs to generate and debug candidate programs in the context of ILP these days?

In short, yes, but we can’t know the call graph before hand. The learning procedure is really just an SLD-Refutation, so we can’t know whether it will end in an empty goal clause, a non-empty goal, or not end at all, until the proof terminates… or doesn’t. If we could know that without carrying out the proof, well, then, I guess we wouldn’t need to carry out the proof in the first place. In any case any approach that can limit the call graph in an SLD-Refutation would also work in ILP (or in any case in Meta-Interpretive Learning, the specific approach I’m studying) but there’s no such thing.

What Inverse Entailment did with Bottom Clauses is that it tried to calculate exactly that kind of depth bound: the number of steps it takes to “walk over” the “background knowledge” (that’s a Prolog program that is given as input along with training examples usually in the form of ground facts). What I, personally, have learned from that period of ILP is to be very careful about trying to apply arbitrary limits, or even principled ones. This is not minimax with alpha-beta cutoff. We’re trying to prove a program that we don’t yet have, so we are in a state of very high uncertainty. It’s better to let Resolution do its thing (which comes down to eliminating uncertainty, if you think about it very hard) without intervening.

Resolution is hard core. I think it’s greatly overestimated. Did you know it’s also the basis of one of the two dominant branches of SAT-solving algorithms? That’s the Conflict-Driven Clause Learning approach, itself based on the older Hillary-Putnam approach that was almost pure resolution (although in the propositional case, not first-order).

I haven’t seen that, specifically, I don’t think. I know people generate Prolog with LLMs and say that they do ILP, though. I’m not sure how to react to that, to be honest. ILP essentially started as an attempt to “invert deduction” as a way to achieve induction. This is an old idea that probably goes all the way back to Aristotle and IIUC was popularised by William Stanley Jevons (the same Jevons as in Jevon’s Paradox and Jevon’s Number). Obviously LLMs don’t work that way.

Inverting deduction in a logic programming setting was I believe first proposed by Gordon Plotkin in his doctoral thesis in 1972 (https://era.ed.ac.uk/items/fa0309ce-00e7-4373-b39e-0d9ce45f0b21), just a few years after Robinson introduced resolution. In the Introduction to his thesis Plotkin describes a conversation with R. J. Popplestone (of Poplog fame) in which Popplestone remarked " that, just as the unification algorithm was fundamental
to deduction, so might a converse be of use in induction." In the context, “deduction” essentially meant Robinson’s recently-introduced Resolution. This motivated Plotkin to invert unification. Plotkin didn’t call his work “ILP”. The same line of investigation was picked up by Stephen Muggleton in the 1990’s who proposed Inverse Resolution and named the field in 1991. Inverse Resolution was followed by Inverse Implication and then Inverse Entailment. All those were essentially attempts to make Prolog, or in any case, Resolution-based deductive logic programming, work in an inductive manner.

What that means in practice, in ordinary logic programming we have a set of definite clauses, our logic program, and we prove a Horn goal (definite clauses without heads). To make Prolog inductive we also start with a logic program, call it B, that we call the “background knowledge” in ILP, but this time we have a set of Horn goals,our positive examples, or E+, that cannot be proved with B. So we need to extend B to prove those goals. If we can do that successfully, we end up with a new set of definite clauses, call it H (for “hypothesis”), that when added to B, proves E+ (more precisely, B and H entail E+). In the same way if we have negative examples E-, we want H added to B to not prove E-.

All that is quite different from generating Prolog code with an LLM, although it can achieve the same goal, if the goal is to automatically generate Prolog. What’s missing though is the ability to include “background knowledge” and to make sure that the generated code is consistent with it. LLMs can’t really do that very well at all.

And in any case, if you’re going to use an LLM to generate code, then why generate Prolog, an obscure language that very few people know or use? LLMs can genearte Pythoh, javascript, C#, Java, etc, just as well and probably better than Prolog. There’s very little incentive AFAICT to do “ILP with LLMs” then.

What about higher order hypothesis. For example the “notion
left recursive” (L), could be a hypothesis, in the sense that
somebody hypothesizes that this and that “relation” (R) is L:

L(X, Y) :- R(X, Y).
L(X, Y) :- L(X, Y), R(X,Y).

Unfortunately you can do that also with knowledge graphs, right?
Define a subclass L of class R. With some attributes that make
it left recursive. Like saying more concise:

L ⊇ R ∪ (L ° R)

So when your knowledge graph can constrain algebraically,
like class hierarchies that get specialized by equations, you
could express the same. Such a hypothesis class can act as a

stamp for a class instance. And thus generate a program.
It seems to me LLMs can do that quite well. And ChatGPT is
scolding me, when I ask it to blow up parent/2 into anc/2 via

left recursion, wants me to use the right recursive version:

The correct way (right-recursive)

anc(X, Z) :- parent(X, Z).
anc(X, Z) :- parent(X, Y), anc(Y, Z).

Tells me my left recursion doesn’t terminate in ordinary Prolog.
We are definitively doomed, I am now teaching ChatGPT
about tabling and how it deals best with left recursion, maybe

ChatGPT now goes into another abstract hypothesis space.

That’s how Meta-Interpretive Learning works. Those “higher-order hypotheses” (really, second-order definite clauses) are added to the “background knowledge” and used to prove examples by resolution. During the resolution proof, by unification, the second-order variables in those second-order definite clauses are bound to predicate symbols. If the proof complete successfully, a first-order hypothesis, H, is constructed by applying the resulting substitutions of the second-order variables back to the second-order clauses. So in your example L gets bound to “parent”, and R to “anc” etc. First order substitutions are discarded unless we want to learn a program with constants in which case there’s a special notation to do that.

The trick is to not have to define second-order definite clauses that match the program you’re trying to learn exactly, because then you’re just coding the answer by hand. There’s some work on that, particularly by myself, e.g. here:

I don’t think LLMs can do that very well. The example you give, ChatGPT sticking to arbitrary rules that have nothing to do with the task at hand, is typical but in general LLMs don’t do deductive reasoning, let alone resolution, so they can’t really tell you what are the logical consequences of a logic program, or do all the simple but powerfull things that resolution with unification can do.

Neither can ILP. Except you use a resolution theorem prover that
diverts considerable from usual Prolog SLD(NF) resolution.
Because its a left recursive example, which Prolog SLD(NF)

resolution cannnot do. So you cannot verify positive facts,
neither can you verify negative facts, because the hypothesis
space doesn’t match a decidable fragment of Prolog SLD(NF).

So do you say there are ILP systems that basically implement
their own tabling so to speak. And if yes, how do they do it?
With delimited continuations, or does ILP in its long history from

Plotkin to now have evolved to use some other methods for
resolution theorem proving? Just curious how some technique
form ILP could related to OPs inferencing problem and whether

there would be something interesting out of the box? With
ordinary Prolog SLD(NF) and a left recursive hypothesis:

anc(X, Z) :- anc(X, Y), parent(Y, Z).
anc(X, Z) :- parent(X, Z).

You simply get a crash or infinite loop for queries:

?- anc(carlo, david)
/* crash or infinite loop */

?- \+ anc(carlo, david).
/* crash or infinite loop */

So while primafacie ILP adresses machine learning it has
also to deal with inferencing issues. Given that definite
clauses are turning complete there is no decidability

in general of positive facts, neither of negative facts,
nor decidability of logical equivalence of hypothesis either.
You might restrict to definite clauses with the Datalog

restriction, i.e. no function symbols, then situation looks a
little better, but you nowhere wrote Datalog, you always say
definite clauses, which is a term from literature and means

no negation. The term definite clauses itself does not necessarely
imply no function symbols. If you would stipulate no function
sysmbols, you can attack it with SLD resolution and its even

simpler tabling extensions, not going into some 3 valued
semantics as for example SWI tabling for SLDNF does. But the
ILP community might use “definite clauses” with another meaning?

BTW: Alain Colmerauer himself championed reminding us that
definite clauses are Turing complete, by constructing a Turing
machine in pure Prolog. He even did curious things in 2008:

Back to the complexity of universal programs
Alain Colmerauer - 2008
http://alain.colmerauer.free.fr/alcol/ArchivesPublications/SydneyUniversal2008/Colmerauerl2008.pdf

The Wikipedia code is nonsense, since it uses once/1 and !/0,
you need to be more careful to map a TM to definite clauses. With
the function symbols ('.')/2 resp. s/1 you can do lists resp Peano.

? You generate a program. Next, you compute the call graph for that program and you find the longest cycle as well as the longest non-cyclic call chain. Now, for each test you set the depth limit to InputTokens * CycleLength + CallChainLength + SomeSafety.

Of course, this is overhead, but I think pretty small compared to all the work you are doing in an ILP iteration anyway.

Why not? You tell it “here is a background program, here are positive and here are negative examples, can you extend the background program to covert this test data as good as possible?”

That is not so clear. I once asked ChatGPT to do some formalization of natural language and it spontaneously suggested to use Prolog for that. Could of course be because I chatted before about Prolog. On the other hand, I mostly used (used as I switched to Claude since) it for C and only some tiny experiments with Prolog. Besides, a classification algorithm in Horn clauses is much easier to understand than most imperative code doing the same. If you turn this into an agent where the LLM can run the programs it generates I’d expect it to behave way better than the Aleph state-of-the-art (see below).

P.s. Thanks for summarizing the state. I did some work with Steve Moyle when he was doing his PhD on Aleph in Oxford. That is about what I know about ILP. Good to see how things progressed.

1 Like

You can add background knowledge to an LLM session by
prompt engineering. Just start your LLM session with a few
assertion statements. The session “context” is a big "background

knowledgebase". An LLM is nothing else than an ILP system
with a NLP interface. What is missing from an ordinary ILP
systems is the huge NLP vocabulary and grammar as well

the huge space of hypothesis and patterns an LLM can access,
and nowadays through RAG, the possibility to even dynamically
extend the context during a session, by including up to date

resources from the internet solving the cut date problem of
an LLM. The huge amount of resources available to an online
LLM, and even compressed into a local LLM through some

neural network technology, and run it on an AI Laptop, makes it
very hard for Prolog to compete with LLMs. But Prolog might
learn how to do it? Who knows? Its never too late. Assuming

there will be or there is a technology flow from LLM to Prolog,
performing some late adoption would be nothing bad. Disagreeing
with my equation LLM = (NLP + ILP) * HUGE is also ok.

Isn’t that just a SLD with a depth limit somehow. You still in the
domain of Datalog for definite clauses? Some people do
count the variables in the rules, and then get combined complexity:

|P| * |D| ^ |V|

This is not a depth limit, but a time limit. Problem is the depth limit
shows PSPACE, this then often translates exponential into some,
time limit, i.e. hence EXPTIME.

But I am a little bit rusty. LLMs probably don’t try that, don’t know
what heuristic they have in stock, they won’t do it, because of
more tight time limits and since the requirement is an interactive

session, and not some offline batch learning. If you fix the Datalog
program P , paradoxically complexity looks different. But an ILP
tries to deal with different hypothesis, so I guess

the Datalog program P also changes, right?

See also:

Complexity and Expressive Power of Datalog
https://www.cs.ox.ac.uk/files/1018/gglecture7.pdf

That was not my intend. It is intended as SLD with a protection that kills non-terminating SLD inference (assuming you check that the depth limit has been exceeded and you reject the program in that case).

Time limit is complicated. call_with_inference_limit/3 is better. I guess it is still hard in ILP context though, unless you state that programs above a certain complexity are not acceptable. Non-terminating programs are always unacceptable :slight_smile:

I consider that a dubious statement. Prolog inference is logically correct (within its limitations). With LLMs you never know. It is like asking an LLM to do arithmetic. For simple problems this actually works, for more complicated expressions it is between a little wrong and really very wrong. What an LLM can do is translate an NLP problem into an arithmetic expression and use a calculator. In the same sense, an LLM can generate a Prolog program. Actually, people are doing that :slight_smile:

2 Likes

Hi all — first post here, thanks for having me :slight_smile:

@Sean, I had exactly the same need recently and ended up building prolog-trace-viz (https://www.npmjs.com/package/prolog-trace-viz) for it. Might be worth a look.

Prolog has been a bit of a rediscovery for me lately — genuinely loving it. Happy to answer any questions about the tool.

2 Likes

Yes, but how? The reason I want to avoid infinite recursion is so that I can “generate” a program to begin with.

“Generate” is really not the right terminology. You can generate a program by starting with an empty clause and then adding one literal at a time to it, until the program passes some kind of test that says you have the right program. This has been tried exhaustively in ILP over 30 years and it doesn’t work.

One reason is efficiency: a logic program is a set of clauses and a clause is a set of literals, so if you generate all possible logic programs you are generating the set of sets of sets of literals: a powerset of a powerset. Obviously that blows up immediately. You can try to reduce combinatorial explosion by being clever with heuristics and pruning and whatnot, but at the end of the day you’re effectively searching a very large haystack (the set of all progams) for a tiny needle (the program you want). In fact, the larger the set of programs you can generate, the greater the probability of error, because there are many programs that look like the one you want, but aren’t exactly that. Then you need many, many examples to try and distinguish between similar-looking programs.

The alternative I’m discussing, and the approach I’m championing if you will, i.e. Meta-Interpretive Learning, instead casts the learning problem as an SLD-Resolution proof. This has many advantages, including efficiency (obviously you don’t need to search a powerset of a powerset to carry out an SLD-Resolution proof). But, exactly because it’s SLD-Resolution, you have to be careful to avoid infinite recursions.

But those infinite recursions happen during the SLD-Resolution proof that gives you the program you’re looking for. So you have to do whatever you do to avoid infinite recursion before you have a program, during the learning phase.

Well the reason is as you say below:

LLMs don’t do proofs by Resolution. At best they can approximate Resolution. But I guess I need to give you an example of what I mean by that. Stay tuned, it’s lunch time :slight_smile:

1 Like

Yeah, my intended playbook looks like sleeping with the enemy.

Why? I didn’t say you have to approximately learn the compression
resulting in a lossy compression. I also admit the construction of
compressions, in particular a lossless compression. Compression is

a general technique that applies to various forms of inferencing
and deduction. Its relatively easy to turn a Horn clause exactly into
an artificial neural network (ANN) with ReLU. At least in the

boolean case. Just notice, how NOT and AND can be mapped:

NOT(A) = 1 - A
AND(A_1,..,A_n) = ReLU(A_1+...+A_n-n+1)

Now that you have AND and NOT you can model any boolean circuit.
Such results were already published 50 years ago, using another
activation function than ReLU, like some Sigmoid. ReLU has become

popular recently, and is offered by most AI accelerators?

BTW: Take this Horn clauses here for trait/4 for example:

You could easily compress it and then run it on an AI accelerator,
meaning you would depart from the sequential top-down execution,
and tap into a certain parallel bottom-up execution. But it needs

a numerical encoding. For single subject formulas like the trait
example in the above, you have already parameter codes
like p000053, p000133, p000106, etc.. But you would need

to scale it somehow, since an AI accelerator can only accelerate
a small matrix and doesn’t compute a transitive closure per se.
Frankly I don’t have a full playbook how to do it. A playbook would

also need to say how to reinterpret the ANN activations back.