Can ILP learn geometry rules with floating point numbers

I’m sorry to jump in and derail the discussion, but given how @stassa.p doesn’t seem to mind, I would like to ask a question about ILP.
It seems ILP heavily favors program that can only juggles atoms (you mentioned datalog earlier), but what about numbers (integers or reals) ?

For example, I would like to use ILP to do geometry.
Let’s say you want to rediscover the rules for parallelism.
You feed to the ILP learner examples of 2 lines, each defined by 2 points, where some are parallel, others aren’t.
Importantly, coordinates are floats.
Would state-of-the-art ILP systems be able to induce the correct rules for parallelism ?

Hi @kwon-young . First I want to apologise to @mathmaster13 for hijacking this thread. The moderators may want to split this whole conversation to a new topic, but that would already be a lot of work to do, so my apologies to them too.

The tl;dr

Now, there is a lot to say here, so I’ll cut through to the chase. You ask whether state-of-the-art ILP systems are able to induce the correct rules for parallelism. The answer is yes.

The way to do this is to add to the background knowledge (remember, in ILP we have two kinds of data: examples and background knowledge) suitable predicates to operate on floating point numbers. For example, you could add any of the arithmetic functions that go with is/2, or you can manually code your own arithmetic predicates, whatever best suits your needs.

The longer example

To give you an example of how to do that from my own work, below is part of the background knowledge I used in a recent project.

The project’s goal was to create an AI agent to guide a robot boat to navigate on the surface of a body of water. I wanted a way to constrain the agent’s movements to only changes of direction that take it closer to its destination (a heuristic to avoid trying out unnecessary move actions). The agent is given a map of the area and its position on the map is represented as a term position(X,Y,Theta) where all three of X, Y and Theta are floating point numbers. (Theta was not needed after all so you’ll see it’s ignored in the code). Because coordinates are floating point numbers, predicates that operate on them must be able to manipulate floating point numbers.

Here’s how this works. First we have the predicate moving_forward/2 that checks that the agent’s next move reduces the (Euclidean) distance between the agent’s current position and its destination:

%!      moving_forward(+State1,+State2) is det.
%
%       Ensure the agent is moving towards its destination.
%
%       True when the distance of the agent's position in State1 from
%       its destination is less than the distance of the agent's
%       position State2 from its destination.
%
%       Used as move constraint. Enabled by setting option
%       move_constraint(moving_forward/2) in model configuration.
%
moving_forward(S1,S2):-
        inspect(position,position(X1,Y1,_),S1)
        ,inspect(position,position(X2,Y2,_),S2)
        ,inspect(destination,destination(Xd,Yd,_),S1)
        ,maplist(term_value,[X1,Y1,X2,Y2,Xd,Yd],[X1_v,Y1_v,X2_v,Y2_v,Xd_v,Yd_v])
        ,magnitude(Xd_v/Yd_v,X1_v/Y1_v,M1)
        ,magnitude(Xd_v/Yd_v,X2_v/Y2_v,M2)
        ,M2 < M1.

The distance between the two positions is calculated by magnitude/2 as the magnitude of the vector between the two sets of their coordinates, as follows:

%!      magnitude(+C1,+C2,-Magnitude) is det.
%
%       Magnitude of a vector between two sets of coordinates.
%
%       Example:
%       ==
%       ?- actions:magnitude(0/0,3/4,M).
%       M = 5.0.
%       ==
%
magnitude(X1/Y1,X2/Y2,M):-
        Dx is X2 - X1
        ,Dy is Y2 - Y1
        ,M is sqrt(Dx^2 + Dy^2).

move_forward/2 is one of several constraints that can be imposed on the agent’s moves. These are selected before learning, with a configuration option, move_constraint/2:

%!      move_constraint(?Predicate) is semidet.
%
%       Predicate indicator, Symbol/arity of a move constraint.
%
%       Used to choose the constraints that will apply to move actions.
%
%       Known constraints:
%
%       * within_action_budget/2: constraints the length of the list of
%       actions in the state plan term in the state vector to the value
%       of B in list_budget(action,B).
%
%       * in_scan_area/2: rejects moves that end with the boat outside
%       the limits defined in the scan area term in the state vector.
%
%       * passable/2: rejects moves that end in unpassable map terrain
%       as read off the current occlusion grid map.
%
%       * moving_forward/2: ensures the boat keeps moving towards the
%       coordinates of the destination term in the state vector.
%
move_constraint(within_action_budget/2).
move_constraint(in_scan_area/2).
move_constraint(passable/2).
move_constraint(moving_forward/2).

Each of those constraints is checked before a move is taken by a predicate move_constraints/2:

%!      move_constraints(+State1,+State2) is det.
%
%       Check move constraints in current State1 and next State2.
%
%       State1, State2 are state vectors. State1 is the state vector of
%       the state at the start of a move and State2 is the state vector
%       at the end of a move.
%
%       This predicate goes through all constraints defined in the model
%       configuration option move_constraint/1 and calls each of them.
%
%       Constraints must be defined as dyadic predicates. State1 and
%       State2 will be passed to each such predicate.
%
%       If any of those constraints fails, a move action transitioning
%       from State1 to State2 will not be taken.
%
move_constraints(S1,S2):-
        forall(model_configuration:move_constraint(C/2)
              ,(T =.. [C,S1,S2]
               ,call(T)
               )
              ).

move_constraints/2 is then used to define background predicates that represent the agent’s movement from one point on the map to another, in one of eight compass directions. You can see the calls to move_constraints/2 in the listing of the first-order background knowledge, below:

?- list_mil_problem(move/2).
Positive examples
-----------------
move([destination(x(70.0),y(80.0),theta(0.0)),plan(actions([]),count(0)),scan_area(x_min(0),x_max(249),y_min(46),y_max(136)),position(x(30.0),y(60.0),theta(0.0))],[destination(x(70.0),y(80.0),theta(0.0)),plan(A,B),scan_area(x_min(0),x_max(249),y_min(46),y_max(136)),position(x(70.0),y(80.0),theta(0.0))]).

Negative examples
-----------------

Background knowledge (First Order)
----------------------------------
move_north/2:
move_north(A,B):-modify(position,increment,y,A,C),move_constraints(A,C),inspect(position,position(D,E,F),C),register_action(C,move,move_north,[D,E],B),debug(actions,'Moved North ~w',[position(D,E,F)]).

move_north_east/2:
move_north_east(A,B):-modify(position,set,theta(0.0),A,C),modify(position,increment,x,C,D),modify(position,increment,y,D,E),move_constraints(A,E),inspect(position,position(F,G,H),E),register_action(E,move,move_north_east,[F,G],B),debug(actions,'Moved North East ~w',[position(F,G,H)]).

move_east/2:
move_east(A,B):-modify(position,set,theta(0.0),A,C),modify(position,increment,x,C,D),move_constraints(A,D),inspect(position,position(E,F,G),D),register_action(D,move,move_east,[E,F],B),debug(actions,'Moved East ~w',[position(E,F,G)]).

move_south_east/2:
move_south_east(A,B):-modify(position,set,theta(0.0),A,C),modify(position,decrement,y,C,D),modify(position,increment,x,D,E),move_constraints(A,E),inspect(position,position(F,G,H),E),register_action(E,move,move_south_east,[F,G],B),debug(actions,'Moved South East ~w',[position(F,G,H)]).

move_south/2:
move_south(A,B):-modify(position,set,theta(0.0),A,C),modify(position,decrement,y,C,D),move_constraints(A,D),inspect(position,position(E,F,G),D),register_action(D,move,move_south,[E,F],B),debug(actions,'Moved South ~w',[position(E,F,G)]).

move_south_west/2:
move_south_west(A,B):-modify(position,set,theta(0.0),A,C),modify(position,decrement,y,C,D),modify(position,decrement,x,D,E),move_constraints(A,E),inspect(position,position(F,G,H),E),register_action(E,move,move_south_west,[F,G],B),debug(actions,'Moved South West ~w',[position(F,G,H)]).

move_west/2:
move_west(A,B):-modify(position,set,theta(0.0),A,C),modify(position,decrement,x,C,D),move_constraints(A,D),inspect(position,position(E,F,G),D),register_action(D,move,move_west,[E,F],B),debug(actions,'Moved West ~w',[position(E,F,G)]).

move_north_west/2:
move_north_west(A,B):-modify(position,set,theta(0.0),A,C),modify(position,increment,y,C,D),modify(position,decrement,x,D,E),move_constraints(A,E),inspect(position,position(F,G,H),E),register_action(E,move,move_north_west,[F,G],B),debug(actions,'Moved North West ~w',[position(F,G,H)]).

Background knowledge(Second Order)
----------------------------------
(Chain) ∃.P,Q,R ∀.x,y,z: P(x,y)← Q(x,z),R(z,y)
(Identity) ∃.P,Q ∀.x,y: P(x,y)← Q(x,y)

Metasubstitution constraints
----------------------------
:- dynamic configuration:metarule_constraints/2.
:- multifile configuration:metarule_constraints/2.

configuration:metarule_constraints(m(identity, P, P), fail).
configuration:metarule_constraints(m(chain, P, P, _), fail).
configuration:metarule_constraints(m(chain, _, P, P), fail).

true.

The single, positive training example is the start and end state of a move from a starting position to a destination position. The learned program is a strategy to arrive to the starting position to the destination. Below I list one example of such a learned progam. While it looks like the agent is moving all over the place, if you squint a bit you’ll see that its movement has a bias towards the north-east. That’s because the destination in the training example is to the north-east of the starting position:

?- time( learn(move/2,_Ps) ), print_clauses(_Ps), length(_Ps,N).
% 101,396,427 inferences, 9.250 CPU in 35.975 seconds (26% CPU, 10961776 Lips)
move(A,B):-move_east(A,B).
move(A,B):-move_north(A,B).
move(A,B):-move_north_east(A,B).
move(A,B):-move_north_west(A,B).
move(A,B):-move_south(A,B).
move(A,B):-move_south_east(A,B).
move(A,B):-move_east(A,C),move(C,B).
move(A,B):-move_east(A,C),move_north(C,B).
move(A,B):-move_east(A,C),move_north_east(C,B).
move(A,B):-move_east(A,C),move_south(C,B).
move(A,B):-move_east(A,C),move_south_east(C,B).
move(A,B):-move_north(A,C),move(C,B).
move(A,B):-move_north(A,C),move_east(C,B).
move(A,B):-move_north(A,C),move_north_east(C,B).
move(A,B):-move_north(A,C),move_west(C,B).
move(A,B):-move_north_east(A,C),move(C,B).
move(A,B):-move_north_east(A,C),move_east(C,B).
move(A,B):-move_north_east(A,C),move_north(C,B).
move(A,B):-move_north_east(A,C),move_north_west(C,B).
move(A,B):-move_north_east(A,C),move_south_east(C,B).
move(A,B):-move_north_west(A,C),move(C,B).
move(A,B):-move_north_west(A,C),move_north_east(C,B).
move(A,B):-move_south(A,C),move(C,B).
move(A,B):-move_south(A,C),move_east(C,B).
move(A,B):-move_south_east(A,C),move(C,B).
move(A,B):-move_south_east(A,C),move_east(C,B).
move(A,B):-move_south_east(A,C),move_north_east(C,B).
move(A,B):-move_west(A,C),move_north(C,B).
N = 28.

So the way this all comes together is that, during learning, the agent tries different moves and puts together a strategy to arrive at its goal, but it’s constrained to only try moves that reduce the distance to its goal. That’s a classic distance-based navigation heuristic.

I hope this gives you an idea of the kind of background knowledge you would have to define if you wanted to deal with floating point numbers, particularly in the context of geometry.

All of the output above is from Louise the MIL system I used in that project, but the same approaches would apply with any recent ILP system. e.g. I believe you could take the same approach with Popper.

Further discussion

Now, I would like to add something, because all this might look like too much work. And it is! Especially if all you’re trying to do is to re-derive a well-known result, like the rules of parallelism.

Unfortunately most of ILP presupposes a great deal of knowledge engineering that goes into constructing a suitable background knowledge base. In the worst case it looks like we are hand-coding the solution and giving the learning system exactly the background knowledge it needs to solve a problem, no more, and no less. That’s a fair criticism and a lot of work on ILP is subject to it.

The challenge then is to find ways to learn without having to manually define all this background knowledge. In my latest work, which I linked above, I show how to cast the ILP learning problem as a grammar learning problem, which basically means that we are trying to learn an automaton of some sort.

So, nowadays, if I was trying to learn the rules for parallelism, I would not think of geometry and floating point arithmetic, as such, but instead I would try to answer the question:

What kind of automaton accepts or generates only parallel lines?.

Then I would try to come up with examples of the output of such an automaton and give it to my system to re-construct that automaton from the examples. In that case, the First-Order background knowledge would only need the pre-terminals in the target language; essentially those would be the states of the target automaton. As to the Second-Order background knowledge, it now becomes a Second-Order Definite Normal Form, which is exactly what it sounds like : a normal form; expressed as a set of second-order definite (datalog) clauses. So for instance we could use a second-order normal form for context-free grammars, like a second-order version of Chomsky Normal Form; or a second-order normal form for context-sensitive grammars, like a second-order version of Kuroda Normal Form.

This is a more advanced method that is still under construction, so to speak, but I am much more excited about it, than the traditional, manual way of hand-crafted background knowledge I show above. However, if this is the first time you work with ILP I think you’d find the traditional way easier to reason about.

@stassa.p Thank you very much for your time and answer to these beginner questions :slight_smile:

This is kind of amazing that you mention grammars, since my own interest and motivation for my first question is precisely this: graphical grammars.
To be more precise, here is a blog post I wrote that define what I mean by this: Triangle · Kwon-Young Choi

Basically, I want to write (or induce automatically) pure relational graphical grammars that can relates high level semantic structures (let’s say a triangle) and a list of graphical primitives (a list of segments).

So, here is my poor attempt at reformulating this pb as you do:

An ILP setup has positive/negative examples + background knowledge:

  • Positive examples would be a set of 3 segments forming a true triangle
  • Negative examples would be any set of any number of segments which do not form a triangle
  • Background knowledge would be spatial equality of 2 points
    • in a purely symbolic geometry setting, we could just reuse prolog unification point(A, B) = point(C, D)
    • in setting with real coordinates, we could use euclidian distance with a threshold (it would be amazing if the threshold could be learnt !)

Then your ILP system (Louise or Popper) could learn the chain of equality that would produce a triangle ?

For more context on what I am interested in:

Basically, my interest stems from trying to write graphical grammars for Optical Music Recognition (OMR).
This is the field of trying to extract musical information from images of music scores.

In contrast to OCR (optical character recognition), which is now dominated by deep learning models, OMR has a very strong and complicated graphical grammar.
I believe (this is my own opinion) that formalizing these graphical grammar rules in code instead of learning them purely through deep/machine learning can be immensely more efficient and accurate.
However, I can assure you that writing these rules by hand is an absolute nightmare (and I love it ^^).
That’s why I am searching for alternatives, which ILP is.

Hi @kwon-young, OMR sounds interesting and yes, maybe that’s a field that ILP could make some progress beyond what’s possible with deep learning. Obviously I’d be really happy if that was the case. Maybe we should take that conversation private to avoid polluting the thread, on the other hand I’d like to hear what the rest of the community has to say.

@EricGT what do you think? I would like to dig into this problem that @kwon-young is intersted in, of learning OMR rules with ILP, specifically systems that I’ve created with SWI-Prolog. Is this the right forum for this kind of discussion and is it something we should share with the rest of the community as a separate thread? Or should we take the discussion to private messages?

@kwon-young , in the meantime, in my paper linked above I show examples of learning L-Systems with the new, self-supervised MIL system Poker (carefuly, that’s not Popper! I know, it’s confusing). L-Systems are a grammar formalism used to describe shapes with branching and self-similarity, like fractals and plants. You can find more information on wikipedia: L-system - Wikipedia.

L-system grammars are used as generators, to generate strings of movement and drawing instructions for a “robot”. This used to be a physical robot drawing on a physical canvas but now it’s usually a simulated robot drawing on a virtual canvas. Complex shapes can be drawn this way, for example below are the rules for the Hillbert Curve fractal, as learned by Poker for one of the experiments in my paper:

s(A,B,C):-y(B,D),minus(A,E),x(E,F),f(F,G),plus(G,H),y(H,I),f(I,J),y(J,K),plus(K,L),f(L,M),x(M,N),minus(N,O),s(O,D,C).
s(A,B,C):-x(B,D),plus(A,E),y(E,F),f(F,G),minus(G,H),x(H,I),f(I,J),x(J,K),minus(K,L),f(L,M),y(M,N),plus(N,O),s(O,D,C).
s(A,B,C):-plus(B,D),plus(A,E),s(E,D,C).
s(A,B,C):-minus(B,D),minus(A,E),s(E,D,C).
s(A,B,C):-f(B,D),f(A,E),s(E,D,C).
s(A,B,B):-empty(A,B).

It’s not immediately easy to see but those are grammar rules. They translate to the following DCG which was used to generate training and testing examples for the experiments in the paper:

hilbert_curve([+,y,f,-,x,f,x,-,f,y,+|Ss]) --> x, hilbert_curve(Ss).
hilbert_curve([-,x,f,+,y,f,y,+,f,x,-|Ss]) --> y, hilbert_curve(Ss).
hilbert_curve([+|Ss])--> plus, hilbert_curve(Ss).
hilbert_curve([-|Ss])--> minus, hilbert_curve(Ss).
hilbert_curve([f|Ss])--> f, hilbert_curve(Ss).
hilbert_curve([]) --> [].

And these are the definitions of the symbols plus, minus, f, x and y, that are also given to the learning system as background knowledge:

plus --> [+].
minus --> [-].
x --> [x].
y --> [y].
f --> [f].

So those are just the pre-terminals of the grammar, like I say abovbe. In the example you give of defining a triangle DCG in your blogpost, I think those pre-terminals, and therefore the background knowledge, could correspond to triangle vertices. Then the system would have to learn the concept of a segment from the ground up. Alternatively you could give the system seg/2 terms but I find that less interesting because if you do that you can also directly define triangles and whatever shape you want. And where’s the fun in that? :slight_smile:

There are more serious reasons than fun also, but let’s first decide how we can continue this conversation. I’ll wait to hear from @EricGT . I’ve polluted this thread too much already!

What does Louise do to refine its Hypothesis Space?

Genetic Algorithms (GAs) in robotics are optimization techniques
inspired by natural evolution, used to solve complex problems like
path planning, trajectory optimization, and robotic design.

The analogy is stricking:

  • Genetic Drift that performs the Adaption:
    GA: Multimembered Binary Crossover of Genetic Code
    ILP: Punctual Refinement based on Conflicting Facts?

  • Genotypes that perform the Adaption:
    GA: Some Genetic Code that leads to a Phaenotype
    ILP: Intensional Rules that lead to Extensional Facts

  • Environment to Adapt to:
    GA: Sometimes Physical World virtually simulated (sic!)
    ILP: Background Knowledge

  • Selection Function to Adapt to:
    GA: Some Fitness Function
    ILP: Positive and Negative Examples

But I didn’t find a paper yet, where ILP would do some cross over? Maybe a
SWI-Prolog prototype? Håkan Kjellerstrand latest Opus was running GA
in Picat for 100 of hours, to discover some physical constant formulas,

using his symbolic regression program, a method which had a revival in 2022.

I am not sure I understand the question. Meta-Interpretive Learning (MIL) that Louise uses is not an optimisation-based algorithm. As I said before it is a second-order form of SLD-Resolution, so it really helps to think of MIL as you think of Prolog proofs, except that some of the variables bound during the proof are predicate symbols; or “functors” in Prolog parlance.

So what does Prolog do to refine its hypothesis space? Nothing, because it does not have anything that can be called a “hypothesis space”. What it has is a “proof space”, which is in practice a structure known as an SLD-tree.

An SLD-Tree is a tree where the root node is the goal that begins the proof, each intermediary node is a new goal derived during the proof by resolution of a program clause and a goal earlier in the tree, and each branch is a program clause from our logic program.

MIL is exactly the same, except a) program clauses also include second-order definite datalog clauses, with second-order variables existentially quantified over the set of predicate symbols; and b) the logic program is not static but includes new clauses derived during the proof, i.e. the “hypothesis” we are learning. In practice the second-order clauses are reduced to first-order clauses with ordinary universally quantified first-order variables and even the clauses of the hypothesis can be left out of the proof. So again you only need to think about every-day first-order SLD-Resolution.

Maybe you can re-phrase your question as “What does Prolog do to refine an SLD-Tree?”.

And the answer is: unification. Ultimately, that’s the “refinement” in SLD-Resolution (and first-order Resolution in general).

Now, if you want, you can look at unification as a reduction of uncertainty, in the sense that when a variable is unified to another term our uncertainty about its true value is reduced. In that sense Resolution with unification is an optimization procedure, specifically one that minimizes entropy (as a measure of uncertainty) I like to think of it that way, but only in the most general terms, i.e. don’t expect me to write down the formula to calculate the entropy of a logic program or its reduction during unification, or anything like that. It’s just an intuition.

Genetic Algorithms are cool but they suffer from the same problems with local optima as gradient-based optimisation (and IIRC gradient optimisation is often used to train GAs). We don’t have that kind of problem with Resolution.

Ok, I have found your Poker implementation in the vanilla repo.
It doesn’t really have any kind of installation instruction, but I suppose it only relies on swi-prolog ?
Is there a specific version needed ?

If is the case, I’ll try to write an example to learn my triangle dcg rule, and probably asks question back when I am blocked :slight_smile:

The thing is that I know that geometry problems like this has already been extensively studied in the literature, but more under the angle of theorem proving I believe.
But I never really did an extensive literature review of the subject, so I don’t really know what I am talking about.
Is there a link between theorem proving and ILP ?

Oh god, the documentation for Vanilla and Poker is woefully lacking. I’m sorry about that.

No, you don’t need anything but the latest version of SWI-Prolog. I recommend the latest development version that’s more likely to be closer to the one I used for development of Poker.

You will definitely need some help from me to use Poker. Many apologies, it is still new and I’m still working on it. In the meantime you can look at the documentation of the earlier version of Louise GitHub - stassa/louise: Polynomial-time Meta-Interpretive Learning · GitHub that shares much functionality with Vanilla and Poker, particularly in the way it uses “experiment files” and configuration options, and also the logging facilities. I really need to just copy some of the stuff over from that repo to Vanilla.

Edit: I just had a look at the Louise documentation to remind myself of what’s in there and there is a section on Vanilla, including how to control recursion. This is mostly the same for the stand-alone version of Vanilla which was really split off Louise at some point. So that will help also. I’ll see about updating the documentation in the Vanilla repo in the next few days if you are going to be working with it, but please just get in touch if you need anything.

For Meta-Interpretive Leraning, certainly! Because it’s really a second-order form of SLD-Resolution, which is a proof procedure. But more broadly, Inductive Logic Programming is all about going from deductive inference to inductive inference, i.e. from reasoning to learning. And what I’ve learned is that there is no difference between the two: induction is deduction raised to a higher order of logic; reasoning and learning are one. I’ve even published that and the reviewers didn’t throw it out as just a poetic metaphor so I think it’s good.

Edit: most other ILP approaches also somehow use proof to learn btw. For example, Popper casts the learning problem as a satisfiability problem and hands it over to a SAT Solver. Inverse Resolution is what it says on the tin, an inverstion of Resolution theorem-proving. And so on. That’s the main idea behind ILP, to use automated theorem proving for learning.

Gradient-based optimisation often refers to differentiable problems.
GAs need not have a differentiable problem, just like nature where
genetic recombination is seen as rather random, GAs use similar

recombination operators. Its more a random swarm walk, when the
GA is multi membered. Sometimes utterly useless walk that
circles around and even doesn’t come close to local optima.

Producing species that do not survive the next summer. On the
other hand does Louise find a global optimum? I was reading the
TOIL paper. It seems that your Hypothesis Space refinement

is a kind of enumerating the Hypothesis Space:


https://arxiv.org/pdf/2106.07464

It just says “Select M e MM”. And returns an M instance if SLD
derives the empty clause from the negation. So you use resolution
with refutation as the formalization, i.e. the reductio ad absurdum:

G, ~A |- f

Classical logic double negation elimination gives you:

G |- A

So if you run it and show the first found M instance, then the order
in the form of doing your “Select M e MM” and the order in which you
determine instances, is your ranking. If the enumeration defines

your ranking from highest to lowest, then your algorithm finds
a global maximum, by a kind of vacous argument. But what if the
end user wants a more complex preference structure? Like supervised

learning (SL), but with a cost function. So you would not only learn
to imitate a ground truth, and since you use rules have a big amount
of generalization. But instead you would have reinforcement

learning (RL) that minimizes some cost? RL is probably more
common for robotics than only SL. But what do you do with the
generalization that rules have? Do you use a small set for initial

learning, and then validate against a larger set?

The formalisation of MIL in the TOIL paper is out of date, I wouldn’t really rely on it to understand MIL. I mean you can see the refutation proof right there in Algorithm 1 as you point it out, but the paper frames MIL as “metarule specialisation”, which is missing the forest for the trees: the specialisation in the title is the result of unification during resolution, but the point keeps escaping me and it kept escaping me for most of my PhD. When the paper was written I was in the middle of my PhD and I was still struggling to understand how and why MIL works. Both become clear when we re-define MIL as second-order SLD-Resolution, as I presented it above.

There is some preliminary work on that in the following paper:

A more in-depth treatment would need a longer, journal article that I haven’t had the resources to write. Sorry about that.

Its just Modus Ponens in the end, or if you want, you can
also call it a Resolution Step. If you have a hypothesis
K = (H :- B), and if you can prove via refutation, i.e. the

body B of K, as in your Algorithm 1:

G |- B

Then you have also proved the head H of K:

           ------------- (Axiom)
G |- B      K |- B -> H
------------------------ (Modus Ponens)
         G, K |- H

But in algorithm 1 you have chose the rule instance K such that
H is already among the examples E. Basically since you chose
already among the examples, you do a small backward

chaining step from the examples. But I wonder whether somebody
already noticed that Higher Order is not really necessary. If you
would use semantic triples, you could model a rule pattern:

P(X,Y) :- Q(X,Z), R(Z,Y)

As follows:

rdf(X,P,Y) :- rdf(X,Q,Z), rdf(Z,R,Y)

Just use the Predicate (P) from Subject-Predicate-Object (SPO).
It would only become more complicated if the arity > 2, but
could still be done for arity = 1 as well:

P(X) :- Q(X)

Could become:

rdf(X,P,true) :- rdf(X,Q,true)

This is what the OP used in the parent thread of this thread.
He also allowed the value false, I guess the idea is that
rdf(X,Q,false) would model ~Q(X).

The parent thread also contains declarative and procedural views
of @stassa.p Louise ILP systems. Pitty these posts didn’t
land in this new thread here, mostlikely they are a beginning to

understand how the Hypothesis Space is searched:

What I did with the logical explanation that a hypothesis
K = (H :- B) can derive H from B via Modus Ponens, and how
it is procedurally shown in your algorithm 1, is the classical

switch between these two views of Prolog:

  • Declarative View (What is true): Focuses on the logical meaning
    (the “model”) of the code. A rule H :- B means " H is true if B is true".
    It ignores the order of rules and goals.

  • Procedural View (How it works): Focuses on the sequence of operations.
    The rule H :- B means “To solve H, first solve B”. It depends heavily on the
    ordering of clauses and goals (top-to-bottom, left-to-right).

Your algorithm 1 is just a capsule of “To solve H, first solve B”. I think practically
ILP has also a procedural aspect not only an appraisal of a declarative
view, and it somehow determines which hypothesis you list first.

Bye asking HOW is the Hypothesis Space refined, I didn’t wanted to know
WHAT ILP does, it was already shown in the parent thread, but HOW it
finds Hypothesis, in particular in what order and whether the end user

can have preferences. I suspect your higher order input to Louise
changes the found Hypothesis, if you reorder the higher order input,
right? I didn’t try yet, but I guess so, right? What search order does your

“2nd order SLD / MIL” implement when it searches the Hypothesis Space?
It seems the naive view of algorithm 1 is anyways not correct, since
there are algorithms 2, 3, 4, .. Louise does extend the end-user

given higher order templates anyways through heuristics?

BTW: A possible answer could be input order of examples followed
by input order of rule patterns. But I am not 100% sure whether Louise
for example implements that. But algorithm 1 looks a little bit like that,

but pratically it could be also the other way around, rule patterns followed
by examples. Traditionally Prolog itself works with input order, the clauses
of a predicate are executed from top to bottom, even indexing doesn’t

change that. SLG resolution would change the order a little bit, through
worker queues. But comming back to Louise, so maybe I could influence
the result of Louise by also shuffling the examples?

Well as a PhD climbing down from the ivory tower and trying
to sell ILP like its Nescafé instant powder, could be a nobel
cause for a community project like SWI-Prolog.

The original Plotkin dissertation, already posted in this SWI-Prolog
discourse forum some time ago in 2020, is a rather repelling
example for the casual Prolog user I guess?

On the other hand the academic freedom of PhD could also
yield real gold nuggets. Now I have a suspicion, since the
TOIL paper mentions polynomial complexity.

So the higher order patterns are bounded, one might have:

Q(X,Y) :- P(X,Z), R(Z,Y).

But not necessarily:

Q(X,Z,Y) :- P(X,Z), R(Z,Y).

So if such building blocks are applied, maybe its indeed
the case that the Hypothesis Space stays relatively tame,
concerning the complexity of evaluation.

Its possibly similar like trying to stay in 2-SAT and not go into
3-SAT. Means we don’t have an arbitrary factor |V|
like in this formula here?

Disclaimer: I know I am comparing apples with oranges,
since the above is for Datalog. But I guess “2nd order SLD / MIL”
doesn’t want to see itself restricted to Datalog?

Modus Ponens is a distraction. As noted in the other thread, the correct framework for MIL is second-order SLD-Resolution. It’s possible to see this in the implementation of MIL in the Vanilla engine, as a Prolog meta-interpreter. I can also share some anecdotal evidence for that: after explaining MIL as Second-Order SLD-Resolution to James Trewern during my post-doc at the University of Surrey, he was able to implement a new MIL engine in Rust, which is included in the paper I discussed earlier, where he is a co-author. I had no part in building the Rust engine and gave no more directions than the framing of MIL as a kind of second-order Prolog. Mr Trewern then read about the WAM and implemented a version that natively accepts second-order definite clauses. So we went from my theory to his implementation with some success, which should only work if the theory is good to begin with (and if the implementation is tackled by a competent coder, like Mr Trewern).

A few points.

Yes, this is more or less how Vanilla implements resolution with second-order definite clauses except of course the arity of the first-order literals is not restricted to 3 (because the arity of second-order clauses is also unrestricted). Additionally, the symbol is not “rdf” but an arbitrary symbol without special meaning. Finally, we also treat the clauses of the first-order background knowledge and the training examples, both positive and negative, in the same way so everything is end-to-end unifiable, so to speak. We call this process “encapsulation”.

As I pointed out in the other thread, MIL is fully reducible to first-order SLD-Resolution and the way this is done is this process of “encapsulation”. Encapsulation was first introduced in the original Metagol paper: Proceedings Abstracts of the Twenty-Third International Joint Conference on Artificial Intelligence. It is described in more detail in the original Louise paper, in section 4 Implementation: Top program construction and reduction for polynomial time Meta-Interpretive learning | Machine Learning | Springer Nature Link.

Armed with the understanding of MIL as a form of SLD-Resolution it becomes clear that MIL does not search any Hypothesis Space: because SLD-Resolution doesn’t search any Hypothesis Space.

Now that we know there is no Hypothesis Space being searched we can also know with certainty that the order of such a Space doesn’t matter, and it is not changed by changing the order of clauses in the Background Knowledge, either first- or second-order, nor the order of the training examples, either positive or negative. This should be easy to check manually in Louise but all the experiments in the published literature on Louise select examples at random without any specific ordering.

Earlier versions of MIL, in particular Metagol GitHub - metagol/metagol: Metagol - an inductive logic programming system · GitHub, were indeed implemented so as to search a Hypothesis Space and the order of examples and background knowledge clauses did matter, but Metagol will, on backtracking, return every correct hypothesis, just in different orders if you change which examples you show first.

The search of a Hypothesis Space in Metagol was quite unnecessary and inefficient, and the innovation in Louise was to remove it, thus improving performance, and, eventually, understanding of the procedure as a form of Resolution. This was the subject of my PhD thesis.

Speaking of unnecessary and inefficient things, your turn of phrase above is borderline offensive and I expect that you will tone it down in the remainder of our communication. I do not think I have ever been anything but respectful towards you in our exchanges so far.

Like I say, no hypothesis space anywhere to be found now that we know MIL is second-order SLD-Resolution. Indeed, the complexity of MIL is that of executing a Prolog program which is to say, it is dominated by the complexity of the Prolog program. In MIL that is the program we are trying to learn. But the learning procedure itself is as efficient as ordinary Prolog; because it is, in practice, ordinary Prolog.

But, yes, in the TOIL paper we prove this based on the constant number of literals in second-order definite clauses. For the record, the TOIL paper went through several rounds of review in the first of which it was rejected with encouragement to resubmit. One of the reviewers in particular made invaluable contributions to the paper and helped me tremendously to hammer out the proofs of complexity in the Framework section. Which is to say I’m fairly confident that we got it right in the end.

It depends. Currently, only the second-order background knowledge (i.e. the “metarules”) is restricted to Datalog. The first-order background knowledge is allowed to be an arbitrary Prolog program, including negation as failure, impure Prolog, cuts, and whatever else the user wishes.

Since hypotheses are instances of the second-order clauses that are Datalog, hypotheses are also Datalog, but their body literals are calls to ordinary Prolog predicates, so the end result is Prolog, not Datalog.

Finally, as I noted in the earlier thread, even though second-order clauses do not have variables quantified over function symbols, we can use a “flattening” approach that has the same effect. The Hilbert Curve program near the start of this topic shows an example of that.

Well you are free to disagree with terminology. But it doesn’t help
understanding what Louise does. Now I think that is the cause of some
terminological confusion and it is only an artificial separation from Metagol,

that has nothing to do with the real matter of Resolution. For example
the Rust implementation of Prolog^2 does also use subgoal search,
in their realization of Resolution, by using a WAM variant. Prolog^2

seems to be the Rust implementation by James Trewern:

Resolution Resolution in Prolog^2
is handled by a ‘Proof Stack’ which stores the
environment for each goal, with a pointer variable to the current environment.
When an environment is created the only thing stored is a pointer to the goal
atom on the heap. When the goal is tried, if a match is made to the head of a
clause, 4 things are added. Firstly the list of clause indexes left to be matched
against for that goal. Secondly, the bindings created by matching to a clause.
Thirdly whether or not a new clause was introduced when matching, and if this
new clause has an invented predicate. Finally, the environment stores the number
of child goals that are created. These child goal environments are inserted at the
index after the current pointer in the proof stack. With this information, if any
goal fails, the effects of attempting a goal can be undone whilst backtracking. At
each successful step, the solver increments the pointer, once the pointer exceeds
the length of the proof stack then a solution is found, if whilst backtracking the
pointer would be decremented before the first goal then a query has resolved to
false. This approach is an adaptation of the one described for the WAM

Meta-Interpretive learning as Second Order Resolution
https://hmlr-lab.github.io/pdfs/Second_Order_SLD_ILP2024.pdf

It’s not about terminology, but about data structures.

In MIL when we speak of a Hypothesis Space we mean the set of all logic programs that can be constructed by instantiating the second-order definite clauses in the second-order background knowledge with symbols and constants from the first-order background knowledge, the positive examples, and any invented predicate symbols. This is the space that is searched by Metagol, with an iterative deepening search over the cardinality of logic programs (i.e. as the search goes deeper, it finds programs with more clauses).

Louise on the other hand does not search this space and does not perform an iterative deepening search. What it searches instead is an SLD-Tree, i.e. a tree where the root node is a positive example, each intermediate node is a new goal derived during the proof, and each edge is a clause from the first- or second-order background knowledge, or a clause from the hypothesis formed so far. This is evident in the implementation of Louise and also of Vanilla, which I can see you are already looking at.

It should be clear that an SLD-tree and the set of all logic programs are quite distinct objects. There are also empirical results in the first paper on Louise and Top Program Construction that show that Metagol can’t learn logic programs with more than five clauses (it keeps searching for the sixth clause for a week until the experiment is stopped) while Louise can learn programs with thousands of clauses. That is what we should expect to see when comparing a search of an SLD-tree with the a search of a Hypothesis Space, i.e. searching a Hypothesis Space is (much) more expensive.

This is the older version of Louise, before the second-order SLD-Resolution reformulation but I have repeated the same experiments with the Vanilla version of Louise and it works the same.

So, yes, if you want to understand the differences between Metagol and Louise you should consider the difference between searching an SLD-tree and searching the space of all logic progams.

EDIT: to be fair, you will not find all this in the first two papers on Top Program Construction and TOIL that I see you have been reading, since those were written before I understood those subtleties myself. The Second-Order SLD-Resolution paper has a more up-to-date framework. For more details I’m afraid you might have to look at my PhD thesis. Please let me know if you wish to submit yourself to such a cruel torture.

EDIT 2: I should say I appreciate your skepticism of the claim and other academics have also expressed similar reservations. I think the main obstacle is that we are all inured to the idea that, to learn a program, one has to search the space of all programs. It just so happens that by raising Resolution to the second order of logic, that is no longer necessary. Another way to see this is that we don’t need to search a set of programs because we already have a progam: the higher-order definite program formed by the first- and second-order background knowledge. To construct a new first-order definite program (i.e. the target hypothesis) all we have to do is to specialise this higher-order program by unification, during Resolution.

But that’s not a get-out-of-jail-free card! We can’t construct the right first-order program unless we have the right higher-order program. When we know the form of the first-order program, that’s easy, but what happens when we don’t? In some respects, this (i.e. selecting “the right” second-order program) is a much bigger, thornier problem than efficiently searching a large program space. I believe this was the main reason that Popper (the most popular ILP system currently) has stuck with a search-based approach and mostly did away with a higher-order program. I’ve stuck with the proof-based approach and I’m instead trying to find ways to select the right higher-order program.

But, after careful examination of the literature and the available implementations, there should remain no doubt about the way MIL works, at least with the Top-Program Construction approach in Louise and Vanilla. It is proof-based, not search-based.

I guess the Hypothesis Space is explored through 2nd order SLD.

Since its a 2nd order SLD-tree, and not simply a SLD-tree. It allows
logic programs to be plugged in. In that the 2nd order predicate
variables (P, Q, etc..) get instantiated, you create on the fly a logic

program. The challenge is to have coherent predicate variable (P, Q,
etc..) instantiation. So that a hypothesis H doesn’t blur into multiple
hypothesis H1, .., Hn. This meta interpreter below is literally threading

a logic program Prog, that comes into being, in its input and output
parameters. Its the main reference for Prolog^2 by James Trewern
and possibly the mental model that was brought from meta interpreter

to WAM realization, it might not be the main reference for Vanilla
by Stassa Patsantzis. Thats the beauty of Computer Science, a
plurality of viewpoints combine with a plurality of blind spots.

The Hypothesis bluring is possibly not so much a problem when
the Hypothesis is only used in one place. But if it has multiple
invocations and maybe some ordinary Prolog variable instantiations

as well, not keeping track of this commonality would give quite wrong
explanation. Also the below meta interpreter is possibly not extremly
fit, if a 2nd order instantiation happens along disjunction and not

along conjunction. A save point is to use hypothesis 2nd order
instantiation at the top, but this gives probably only a limited form of
hypothesis. What you can do on the top so as to handle disjunction

correctly, kind of already provision certain Hypothesis templates:

Meta-interpretive learning of higher-order
dyadic datalog: predicate invention revisited

https://link.springer.com/article/10.1007/s10994-014-5471-y

Remember that we reduce everything to ordinary Prolog by encapsulation. Prolog doesn’t create a logic program on the fly and so neither does MIL.

Yes, this it the earlier formulation of MIL in Metagol. It is set up as a search of the space of programs. Top-Program Construction as implemented in Vanilla avoids that.

Prolog^2 is not ordinary Prolog. It has its own WAM.
Also it has its own notation for Predicate variables
(P, Q, etc..) the 2nd order variables.

I think there is an abyss between Prolog^2 by James Trewern
and Vanilla by Stassa Patsantzis. Although they are supposed
to do more or less the same. They are both advertized

Prolog^2 is a machine learning and logic programming framework,
that implements native second-order logic and SLD resolution.
https://github.com/JamesTrewern/prolog2

as 2nd order SLD, right? Prolog^2 seems to have liberated its
own thinking, since it has a native WAM implementation. While
Vanilla follows a less progressive more conservative narrative,

where Resolution does some “magic”. But this “magic” is not
exploring the Hypothesis Space. I dunno, what does your
2nd order SLD explore then? The Prolog^2 GitHub readme

says the former paper I referenced was extended:

This is an extension of Meta-Interpretive Learning, which was
first used in Meta-interpretive learning of higher-order dyadic
datalog: predicate invention revisited
https://github.com/JamesTrewern/prolog2

I don’t know exactly what extensions were realized, and
whether non-top invention and disjunction has a better solution?
But you see the extra notation Prolog^2 invented for the logic

program clause invention variation points:

Defining second order clauses
We define these existentially quantified variables using curly braces like so: {X,Y}
P(X,Y):- Q(X,Y), {P,Q}. %Identity
P(X,Y):- Q(X,Z), R(Z,Y), {P,Q,R}. %Chain
P(X,Y):- Q(X,Z), P(Z,Y), {P,Q}. %Tail Recursion
https://github.com/JamesTrewern/prolog2

You could probably model them outside of the Prolog^2 WAM,
via backtrackable global variables in SWI-Prolog. This could
be also an interesting architecture and design for ILP, possibly

lifting also some constraints. Although the top search has also
some advantage, since it can directly tap into examples and
then directly invoke backward chaining.

There could be indeed an Abyss. Maybe the Vanilla Top approach
exists side by side with another approach, since I read. A nice
characterization, long awaited, how the Hypothesis Space is refined:

The TPC pipeline has three stages: Generalise searches
each positive example in parallel to collect all constructible clauses,
Specialise removes any clause that entails a negative example,
and Reduce applies Plotkin’s reduction to eliminate redundant
clauses while preferring general patterns over specific ones.
https://github.com/JamesTrewern/prolog2

You can opt-in into this mode by, which is credited as Patsantzis
and Muggleton (2021)
. But what is the other mode?

To enable TPC, add "top_prog": true to your config file
https://github.com/JamesTrewern/prolog2