The problem with this is that you have no guarantee that your agent setup will ever generate a correct program. What you’re proposing is basically a generate-and-test approach, but generate-and-test is only as good as its generator. If your generator can’t generate the programs you want, it doesn’t matter how you do the testing, you just won’t get what you want.
Aleph is no longer state-of-the art for ILP and it hasn’t been for about 10 years now, but because ILP is a very small field and most people outside it don’t keep up with developments, I often have this conversation. For many people Aleph is the same as ILP, or when they say “ILP” they really mean “Aleph”. That was incorrect even in the '90s because there were other systems and other approaches, besides Aleph, from the beginning of ILP.
Nowadays things have advanced to the point where it’s unfair to compare a system’s performance to Aleph because everyone (in ILP) understands there are things it can’t do that more modern systems can do. For example, recursion and predicate invention are well beyond Aleph’s capabilities, so more recent approaches like Meta-Interpretive Learning and Learning From Failures (as in Popper) that can do those things automatically will always “win” in comparison. Unfortunately reviewers in the mainstream AI conferences aren’t always aware of that so they often ask for comparisons to Aleph, so authors end up comparing their systems to Aleph anyway. That’s a bit sad because Aleph just can’t perform the same tasks as the systems it’s compared to, so it just looks bad and that’s unfair. It was state-of-the-art for its time. But we have better approaches now.
Really though if you want to know what happened in ILP in the last 30 years I have two papers to recommend. One is a retrospective of the first 20 years up to 2010, and the other one adds the next 10 years to date.
Well this was published a few years ago, before the AI boom.
It had already a section 7 Technologies - 7.2 Neural networks, but
saying its limited. But if you look what is published in 2025,
Well maybe not exactly my claim, I didn’t not in detail study what
the paper does. But a certain trend to use technology that taps into
AI accelarators somehow and that is nevertheless logically sound.
The abstract of the paper has this conclusion:
The experimental results show that our approach remarkably
decreased its execution time for bigdata while maintaining accuracy.
Is this paper behind paywall, need to find an unrestricted ANN+ILP
paper. Of course ANNs can be only deployed if the hypothesis space
allows it and dozen criterias I am not aware of. Its not a general method.
I promised an example of the way Meta-Interpretive Learning works so that we have something more concrete to talk about. The following is from my own Meta-Interpretive Learning System, Louise:
So what we have here essentially is a higher-order definite program that comes in two parts: one part is a set of first-order definite program clauses, which is basically a Prolog program. The other part is a set of second-order definite program clauses, which we can see as a second-order Prolog program. In practice the second-order program is datalog (no fucntion symbols) but the first-order program is really an unrestricted Prolog program. That’s in practice.
Unlike Prolog we also have two sets of training examples: positive and negative. The “:-” in the negative examples is just a reminder, they’re not meant to be executed as directives!
Now, with the above loaded in memory, we can ask Louise to learn a program:
And as you see it learns a bog-standard ancestor/2 definition based on parent. It’s a left-recursive definition though. So what’s going on here?
The version of Louise I use is the one that is bundled as an example with Vanilla:
Vanilla is a Meta-Interpretive Learning engine. If you navigate to the src/vanilla.pl module you 'll see that the main exported predicate is the called prove; and if you squint a bit and tune out the debug statemetns, you’ll start to discern the good, old, familiar structure of a “vanilla” Prolog meta-interpreter:
prove(true,_K,_MS,_Ss,_Os,Subs,Subs):-
!
,debug(prove_steps,'Reached proof leaf.',[])
,debug(prove_metasubs,'Metasubs so-far: ~w',[Subs]).
prove((L,Ls),K,MS,Ss,Os,Subs,Acc):-
debug(prove_steps,'Splitting proof at literals ~w -- ~w',[L,Ls])
,prove(L,K,MS,Ss,Os,Subs,Subs_)
,prove(Ls,K,MS,Ss,Os,Subs_,Acc).
prove((L),K,MS,Ss,Os,Subs,Acc):-
L \= (_,_)
,L \= true
,debug(prove_steps,'Proving literal: ~w.',[L])
,clause(Os,L,K,MS,Ss,Subs,Subs_,Ls)
,debug(prove_metasubs,'New Metasubs: ~w',[Subs_])
,debug(prove_steps,'Proving body literals of clause: ~w',[L:-Ls])
,prove(Ls,K,MS,Ss,Os,Subs_,Acc).
That’s what’s doing the learning! It’s a Prolog meta-interpreter. It’s equipped with a few extra arguments, mainly for book-keeping and for depth-limiting. Taking the arguments one by one from the left:
The stack of literals (L,Ls) is the set of positive examples.
K is an upper limit on the number of substitutions in Subs, effectively limiting the number of clauses that can be added to a learning program.
MS is a list of Second-Order Definite Clauses from the Second-Order Background Knowledge (we don’t write those to the dynamic database).
Ss is a set of predicate symbols, that include the predicate symbols in the positive examples and the first-order background knowledge, and may include one or more invented predicate symbols (those are predicates that are not defined either in the background knowledge or the examples).
Os is a list of options.
Subs is an accumulator of substitutions of second-order variables in the Second-Order Background Knowledge.
The final argument is the “binder” variable that binds to the accumulator of second-order substitutions when the proof succeeds.
In the third clause of the prove/6 meta-interpreter, where you expect to find a call to clause/2, there’s this call to a predicate clause/8 instead:
,clause(Os,L,K,MS,Ss,Subs,Subs_,Ls)
This is a variant of clause/2 that allows fetching of the bodies of clauses not only from the first-order background knowledge, but also two other sets of clauses:
a) the second-order clauses stored in MS, and,
b) the substitutions of the second-order variables in MS, stored in Subs.
When a clause, M, is fetched from MS, then L, the literal we are trying to prove, is bound to the head of M and then the body literals of M (with some of their variables now bound by unification) are placed on top of the meta-interpreter stack and the proof continues.
When a substitution S is fetched from Subs, first it is expanded to a full clause by unification with the corresponding second-order clause in MS, and then its head is unified with L and its body placed on the meta-interpreter stack for the proof to continue.
So what this meta-interpreter is doing is carrying out an SLD-Resolution proof of the positive examples with both first- and second-order definite program clauses, and also with the clauses of a program that is being learned, as it is being learned.
There are two dangerous situations here that can cause infinite recursion:
a) A second-order definite clause in MS can infinitely resolve with itself.
b) A first-order definite clause expanded from a substitution in Subs can infinitely recurse with itself or another clause expanded from Subs.
To avoid the first danger, infinite self-recursion of second-order definite clauses, the argument K is used to limit the length of Subs. That happens in clause/8 and it stops infinite self-resolution.
To avoid the second danger, Vanilla uses two methods.
The first method is tabling. prove/6 can be tabled by switching a flag on or off, like I described above. When the flag is on, prove/6 will happily prove left-recursive programs like the ancestor/2 version I list above.
The second method is to prevent clause/8 from fetching clauses from the expanded substitutions in Subs. This is done with a flag in a configuration file. At that point we can also safely turn off tabling.
What happens if we switch the flag that fetches clauses from Subs and turn tabling off? This is what happens:
So now Louise learns a recursive definition of ancestor/2 but not a left-recursive one. This is the effect of the combination of depth-limiting provided by the argument K of prove/6 and the fact that the program we’re learning can’t resolve with itself until it’s fully learned.
OK. That’s a lot to take in and you have work to do, but I hope it helps better understand how Meta-Interpretive Learning works and why I don’t think it’s possible to apply the call-graph method you suggested. I also hope it explains where the depth limits and tabling come in.
I haven’t explained the use of negative examples, and how predicate invention works, to keep things short…er. Ask me if you are curious to know, I’m happy to explain.
Oh and, I forget to say: Happy Easter to those who celebrate it. I’m Greek so our Easter is later
It depends on what you mean by “ILP”. The approach I discuss above, Meta-Interpretive Learning (abbreviated as MIL) is really a second-order form of SLD-Resolution. Note that I say “SLD” not “SLD(NF)”. No Negation As Failure! At least not in principle.
The nice thing about this framework is that it is fully reducible to first-order SLD-Resolution. That’s great because then we don’t need to prove soundness and completeness. Our framework inherits the soundness of SLD-Resolution and its completeness (i.e. the Subsumption Theorem and the Refutation Completeness). In the end, whatever claim we can make about MIL we can also make about SLD-Resolution.
What remains is to define the inductive setting, because we want to do induction. In the Standard Setting for ILP we have background knowledge B and positive and negative examples E+ and E-, and we want to find a logic program hypothesis H, such that:
B ∪ H ⊨ E+
∀ e⁻ ∈ E⁻: B ∪ H ⊭ e⁻
In MIL, we have the same assumptions but with some restrictions and some extensions:
First, the background knowledge B consists of two components:
B¹ is a set of first-order definite program clauses.
B² is a set of second-order datalog program clauses.
E+ and E- are sets of Horn goals (i.e. “facts”).
So to answer your question about datalog or not, B² is datalog, so there’s no function symbols; if we need functions we use a technique called “flattening” from the early days of ILP, which replaces each first-order term in an argument of a literal with a new body literal. For example, p(f(X), ...) becomes p(X,...) :- f(X), ... etc.
As to SLD(NF) resolution, in principle B¹ is definite. No NaF. In practice most systems implicitly allow it so you’ll find NaF in many MIL background knowledge sets, but the properties of the framework are only nice without it.
Finaly, it’s important to note that E+ are allowed to be non-ground. In practice most of the time they’re not, but in some cases they are and that opens various interesting possibilities. I have never seen or used E- that were not fully ground.
Given B¹, B², E+ and E- as above we now want to learn a hypothesis, H, such that:
B² ⊨ H
B ∪ H ⊨ E+
∀ e⁻ ∈ E⁻: B ∪ H ⊭ e⁻
So basically the same as with the Standard Setting, but with the added assumption that H is entailed by the second-order background knowledge. Or, in other words, H is a specialisation of the second-order background knowledge.
The great thing about this is that if the three conditions above apply, then we can find H by SLD-Refutation of the positive examples in E+ and this is guaranteed by the soundness and completeness of SLD-Resolution. Negative examples are used to discard hypotheses that violate the third condition, which again we can find by SLD-Refutation of E-.
The way all this is implemented is explained in my earlier reply to @jan. Since our framework is SLD-Resolution, we learn H by SLD-Refutation of E+ with B¹ and B². Importantly, we can also allow Resolution of H with ittself, which enables the learning of arbitrary recursion, including left-recursion. This is possible by switching to SLG-Resolution, or removing the ability of H to resolve with itself. In all cases we also limit the number of times a second-order definite clause can resolve with itself, which is normally infinite.
I hope this addresses your concerns regarding decidability. While it’s true that Turing complete languages are not decidable, that’s in the general case. Obviously we can run any number of logic programs in a Prolog interpreter so the undecidability of FOL and Turing-Completeness are Prolog are not a real impediment in doing practical work. And if theoretical completeness is a problem, we can always stipulate that a depth limit is applied in which case the framework is complete up to a depth of K. I find that this satisfies the most demanding theoreticians. Your mileage may vary!
Please let me know if you have more questions and Happy Easter.
Like I say above, this is a combination of the effect of the argument K which limits the number of clauses in a hypothesis, and the addition or not of the hypothesis itself to the set of clauses fetched during the proof.
It’s not too easy to explain quickly but effectively what happens is that given the limit K (which I believe in this experiment is set to “2”), it is not possible to prove the left-recursive version of ancestor/2 without allowing it to recurse with itself. The tail-recursive version Louise learns is instead learned by resolution with one or more of the positive examples of ancestor/2, which can obviously not lead to infinite recursion (since the example is a ground fact).
Note that K doesn’t limit the depth of recursion, or the number of resolution steps. It only limits the size of the set of clauses that are allowed to be proved. More technically speaking K limits the cardinality of the set of clauses from H that are included in any one Refutation sequence of any one positive example.
This is a softer depth limit and it doesn’t break completeness, so we can still learn a recursive definition of ancestor/2, just without getting stuck in an infinite left-recursion. We can still get stuck in infinite right-recursions though. Which is actually a good thing because it means we preserve completeness
I just tried that, works like a charm, eh voila a left recursive rule,
or even worse a double recursive rule. It profits from the fact
that SWI-Prolog tabling always terminates for Datalog:
?- learn(X).
X = (anc(_A, _B):-parent(_, _B), parent(_A, _)) ;
X = (anc(_A, _B):-anc(_C, _B), anc(_A, _C);parent(_A, _B))
Etc..
But I guess my invention generator is a little bit too random.
It does not use some higher order templates, it freely
invents disjunctions and conjunctions as rule body from
a freely invented list of variables, also freely invents the head.
But its relatively fast. The tabling is controlled via
abolish_all_tables/0 and untable/1 like here:
Hey, that looks great! If the MIT license permits it I would like to consider bundling it as a library with my ILP systems, I bet it could help tracing the learning steps. Thanks for creating it!
No no, my question for this thread is totally different. How
to make Datalog or tabling decisions faster! My theme is still:
Although the OP also wanted to find implications, but
the higher order template is much more easier for that
than in the ancestor example, and the problem is not
a form of abduction, its more finding rules that are directly
deducible. The implications the OP wants to find are not fully
added to the knowledge base, they are only validated. But
back to the performance, I am testing the negative facts via:
But if I would use ASP, I could formulate the negative facts,
as constraints. Just as you wrote it with (:-)/1. ASP (Answer Set
Programming) constraints are declarative rules used to filter
candidate answer sets, eliminating those that violate specific
conditions. This would give a bottom-up failure, they would be
tested during model generation, instead of top-down failure as
in SLG resolution. Could be worse, could be better? And since
I wouldn’t use head disjunction, the stable model would be
more or less the same as SLG tabling? At least I suspect that.
So the ILP framework has much more instances, it can be used
with SLDNF, with SLG, and as well with ASP I suspect.
Maybe ASP can even do the rule invention? What else?
Last but not least, I don’t know whether ILP has already
been used with Soufflé, who knows?
Yeah the problem is the OP never formally specific what he
wants. This is unlike @stassa.p who specified what he wants, namely
he does abduction, find a Hypothesis H on the left hand side of ⊨:
But the OP wants something else, right? The typical task in
mathematics is not empirical induction of a theory, but more
deductive derivation of theorems. So when the OP refered
to finding new implication, it would be deduction and not
abduction. The Hypothesis H is not anymore on the left hand
side of ⊨, it is now rather on the right hand side of ⊨:
/* OPs problem, H is on the right hand side */
B ⊨ H
I think this is what the OP means when he says “when
you derive a theorem”. And he has Horn clauses as H in
mind. But I might be totally mistaken, but this calls
Thanks! MIT definitely allows it — just keep the copyright notice and license text with your distribution and you’re good to go. Hope it comes in handy for your ILP system!
Now there is a problem how to solve this with SLD(NF) resolution, While @stassa.p has praised that unlike an LLM, Prolog can do resolution.
There are different types of resolutions methods. From a logical, philosophical
and mathematical, point of view the Prolog SLD resolution has a very
limited expressiveness. The resolution that Prolog can do, was displayed
by @stassa.p , basically citing the usual Vanilla interpreter solve/1,
now extended to prove/7 for various extra purposes:
But the OP wants to derive theorems in the form of implications.
Lets say he is ok with implications using the mirror of (:-)/2:
/* embedded implication syntax */
A -: B
To satisfy the Ops needs and to reuse the resolution theorem
prover of @stassa.p , which was used for evidence E, basically
atom queries, and reusing it for theorems H, the OP would need a
further resolution rule since there is now a further logical connective:
SWI-Prolog does even not natively have such an embedded implication,
so extending it by a depth limit would introduce something new to
SWI-Prolog. But for example ILP could have such an embedded implication
for the purpose of assuming hypothesis during abduction. But the
situation in general is unlike for example λ-Prolog which has embedded
implication already natively. The approach is to leave the terrain
of Horn clauses, and subsequently definite programs, and venture
into the terrain of Hereditary Harrop Formulas. The definition
of clause extends form having (',')/2, (;)/2 to also having (-:)/2,
be careful, λ-Prolog does not have the luxury of two operators (:-)/2
and (-:)/2, so (-:)/2 is overloaded to also denote rules:
But we would still use the word “definite” if the clause heads
are atomic, and if the clause body does not have negation as failure.
So the OPs problem really stretches what SWI-Prolog can do
on its logic side by default through SLD(NF). The lambda inferencing
has no particular acronym like SLD(NF), one could call it λ-SLD(NF),
the slides by Amy Felty call it BACKCHAIN, BACKCHAIN when using
the same goal selection rule as Prolog, contains SLD as a subset.
One could try LLMs whether they can do BACKCHAIN? Enrico Tassi
has recently added CHR to λ-Prolog, but what would be also an
interesting questions, does there exist a λ-SLG tabling? Further
extensions of λ-Prolog include, not making it first order (FOL) based,
but making it higher order (HOL) based. One could then imagine
that the abduction shapes in @stassa.p ILP are natively formulated.
One could argue that its not LLMs anymore, rather LRMs,
i.e. a large reasoning model. But the transition from LLMs
to LRMs, happened already earlier a dozen of months before.
And the competition is fierce. DeepSeek Math already
appeared in 2024. In 2025 you find that it can show
subgoaling, basically BACKCHAIN and that it is
designed for formal theorem proving in Lean 4. Lean
basically being a variant of λ-Prolog that has Curry Howard
built-in. I didn’t have yet hands-on, but here is a fancy picture:
But the paper also highlights the role of interning
of tabled ground terms. I guess they need it for
a DCG-ification to pass abductive hypothesis around,
while I used a much simpler untable/1 and table/1
combo and simply asserted the guessed rule.
Jan W. mentioned interning. Since the OP came
from stack traces to explanation trees. But maybe
only sharing, and not really interning would be enough?
I don’t know how Louises meta interpreter would
relate to tabling in that it could profit from SLG resolution.