Can ILP learn geometry rules with floating point numbers

I’m sorry to jump in and derail the discussion, but given how @stassa.p doesn’t seem to mind, I would like to ask a question about ILP.
It seems ILP heavily favors program that can only juggles atoms (you mentioned datalog earlier), but what about numbers (integers or reals) ?

For example, I would like to use ILP to do geometry.
Let’s say you want to rediscover the rules for parallelism.
You feed to the ILP learner examples of 2 lines, each defined by 2 points, where some are parallel, others aren’t.
Importantly, coordinates are floats.
Would state-of-the-art ILP systems be able to induce the correct rules for parallelism ?

Hi @kwon-young . First I want to apologise to @mathmaster13 for hijacking this thread. The moderators may want to split this whole conversation to a new topic, but that would already be a lot of work to do, so my apologies to them too.

The tl;dr

Now, there is a lot to say here, so I’ll cut through to the chase. You ask whether state-of-the-art ILP systems are able to induce the correct rules for parallelism. The answer is yes.

The way to do this is to add to the background knowledge (remember, in ILP we have two kinds of data: examples and background knowledge) suitable predicates to operate on floating point numbers. For example, you could add any of the arithmetic functions that go with is/2, or you can manually code your own arithmetic predicates, whatever best suits your needs.

The longer example

To give you an example of how to do that from my own work, below is part of the background knowledge I used in a recent project.

The project’s goal was to create an AI agent to guide a robot boat to navigate on the surface of a body of water. I wanted a way to constrain the agent’s movements to only changes of direction that take it closer to its destination (a heuristic to avoid trying out unnecessary move actions). The agent is given a map of the area and its position on the map is represented as a term position(X,Y,Theta) where all three of X, Y and Theta are floating point numbers. (Theta was not needed after all so you’ll see it’s ignored in the code). Because coordinates are floating point numbers, predicates that operate on them must be able to manipulate floating point numbers.

Here’s how this works. First we have the predicate moving_forward/2 that checks that the agent’s next move reduces the (Euclidean) distance between the agent’s current position and its destination:

%!      moving_forward(+State1,+State2) is det.
%
%       Ensure the agent is moving towards its destination.
%
%       True when the distance of the agent's position in State1 from
%       its destination is less than the distance of the agent's
%       position State2 from its destination.
%
%       Used as move constraint. Enabled by setting option
%       move_constraint(moving_forward/2) in model configuration.
%
moving_forward(S1,S2):-
        inspect(position,position(X1,Y1,_),S1)
        ,inspect(position,position(X2,Y2,_),S2)
        ,inspect(destination,destination(Xd,Yd,_),S1)
        ,maplist(term_value,[X1,Y1,X2,Y2,Xd,Yd],[X1_v,Y1_v,X2_v,Y2_v,Xd_v,Yd_v])
        ,magnitude(Xd_v/Yd_v,X1_v/Y1_v,M1)
        ,magnitude(Xd_v/Yd_v,X2_v/Y2_v,M2)
        ,M2 < M1.

The distance between the two positions is calculated by magnitude/2 as the magnitude of the vector between the two sets of their coordinates, as follows:

%!      magnitude(+C1,+C2,-Magnitude) is det.
%
%       Magnitude of a vector between two sets of coordinates.
%
%       Example:
%       ==
%       ?- actions:magnitude(0/0,3/4,M).
%       M = 5.0.
%       ==
%
magnitude(X1/Y1,X2/Y2,M):-
        Dx is X2 - X1
        ,Dy is Y2 - Y1
        ,M is sqrt(Dx^2 + Dy^2).

move_forward/2 is one of several constraints that can be imposed on the agent’s moves. These are selected before learning, with a configuration option, move_constraint/2:

%!      move_constraint(?Predicate) is semidet.
%
%       Predicate indicator, Symbol/arity of a move constraint.
%
%       Used to choose the constraints that will apply to move actions.
%
%       Known constraints:
%
%       * within_action_budget/2: constraints the length of the list of
%       actions in the state plan term in the state vector to the value
%       of B in list_budget(action,B).
%
%       * in_scan_area/2: rejects moves that end with the boat outside
%       the limits defined in the scan area term in the state vector.
%
%       * passable/2: rejects moves that end in unpassable map terrain
%       as read off the current occlusion grid map.
%
%       * moving_forward/2: ensures the boat keeps moving towards the
%       coordinates of the destination term in the state vector.
%
move_constraint(within_action_budget/2).
move_constraint(in_scan_area/2).
move_constraint(passable/2).
move_constraint(moving_forward/2).

Each of those constraints is checked before a move is taken by a predicate move_constraints/2:

%!      move_constraints(+State1,+State2) is det.
%
%       Check move constraints in current State1 and next State2.
%
%       State1, State2 are state vectors. State1 is the state vector of
%       the state at the start of a move and State2 is the state vector
%       at the end of a move.
%
%       This predicate goes through all constraints defined in the model
%       configuration option move_constraint/1 and calls each of them.
%
%       Constraints must be defined as dyadic predicates. State1 and
%       State2 will be passed to each such predicate.
%
%       If any of those constraints fails, a move action transitioning
%       from State1 to State2 will not be taken.
%
move_constraints(S1,S2):-
        forall(model_configuration:move_constraint(C/2)
              ,(T =.. [C,S1,S2]
               ,call(T)
               )
              ).

move_constraints/2 is then used to define background predicates that represent the agent’s movement from one point on the map to another, in one of eight compass directions. You can see the calls to move_constraints/2 in the listing of the first-order background knowledge, below:

?- list_mil_problem(move/2).
Positive examples
-----------------
move([destination(x(70.0),y(80.0),theta(0.0)),plan(actions([]),count(0)),scan_area(x_min(0),x_max(249),y_min(46),y_max(136)),position(x(30.0),y(60.0),theta(0.0))],[destination(x(70.0),y(80.0),theta(0.0)),plan(A,B),scan_area(x_min(0),x_max(249),y_min(46),y_max(136)),position(x(70.0),y(80.0),theta(0.0))]).

Negative examples
-----------------

Background knowledge (First Order)
----------------------------------
move_north/2:
move_north(A,B):-modify(position,increment,y,A,C),move_constraints(A,C),inspect(position,position(D,E,F),C),register_action(C,move,move_north,[D,E],B),debug(actions,'Moved North ~w',[position(D,E,F)]).

move_north_east/2:
move_north_east(A,B):-modify(position,set,theta(0.0),A,C),modify(position,increment,x,C,D),modify(position,increment,y,D,E),move_constraints(A,E),inspect(position,position(F,G,H),E),register_action(E,move,move_north_east,[F,G],B),debug(actions,'Moved North East ~w',[position(F,G,H)]).

move_east/2:
move_east(A,B):-modify(position,set,theta(0.0),A,C),modify(position,increment,x,C,D),move_constraints(A,D),inspect(position,position(E,F,G),D),register_action(D,move,move_east,[E,F],B),debug(actions,'Moved East ~w',[position(E,F,G)]).

move_south_east/2:
move_south_east(A,B):-modify(position,set,theta(0.0),A,C),modify(position,decrement,y,C,D),modify(position,increment,x,D,E),move_constraints(A,E),inspect(position,position(F,G,H),E),register_action(E,move,move_south_east,[F,G],B),debug(actions,'Moved South East ~w',[position(F,G,H)]).

move_south/2:
move_south(A,B):-modify(position,set,theta(0.0),A,C),modify(position,decrement,y,C,D),move_constraints(A,D),inspect(position,position(E,F,G),D),register_action(D,move,move_south,[E,F],B),debug(actions,'Moved South ~w',[position(E,F,G)]).

move_south_west/2:
move_south_west(A,B):-modify(position,set,theta(0.0),A,C),modify(position,decrement,y,C,D),modify(position,decrement,x,D,E),move_constraints(A,E),inspect(position,position(F,G,H),E),register_action(E,move,move_south_west,[F,G],B),debug(actions,'Moved South West ~w',[position(F,G,H)]).

move_west/2:
move_west(A,B):-modify(position,set,theta(0.0),A,C),modify(position,decrement,x,C,D),move_constraints(A,D),inspect(position,position(E,F,G),D),register_action(D,move,move_west,[E,F],B),debug(actions,'Moved West ~w',[position(E,F,G)]).

move_north_west/2:
move_north_west(A,B):-modify(position,set,theta(0.0),A,C),modify(position,increment,y,C,D),modify(position,decrement,x,D,E),move_constraints(A,E),inspect(position,position(F,G,H),E),register_action(E,move,move_north_west,[F,G],B),debug(actions,'Moved North West ~w',[position(F,G,H)]).

Background knowledge(Second Order)
----------------------------------
(Chain) ∃.P,Q,R ∀.x,y,z: P(x,y)← Q(x,z),R(z,y)
(Identity) ∃.P,Q ∀.x,y: P(x,y)← Q(x,y)

Metasubstitution constraints
----------------------------
:- dynamic configuration:metarule_constraints/2.
:- multifile configuration:metarule_constraints/2.

configuration:metarule_constraints(m(identity, P, P), fail).
configuration:metarule_constraints(m(chain, P, P, _), fail).
configuration:metarule_constraints(m(chain, _, P, P), fail).

true.

The single, positive training example is the start and end state of a move from a starting position to a destination position. The learned program is a strategy to arrive to the starting position to the destination. Below I list one example of such a learned progam. While it looks like the agent is moving all over the place, if you squint a bit you’ll see that its movement has a bias towards the north-east. That’s because the destination in the training example is to the north-east of the starting position:

?- time( learn(move/2,_Ps) ), print_clauses(_Ps), length(_Ps,N).
% 101,396,427 inferences, 9.250 CPU in 35.975 seconds (26% CPU, 10961776 Lips)
move(A,B):-move_east(A,B).
move(A,B):-move_north(A,B).
move(A,B):-move_north_east(A,B).
move(A,B):-move_north_west(A,B).
move(A,B):-move_south(A,B).
move(A,B):-move_south_east(A,B).
move(A,B):-move_east(A,C),move(C,B).
move(A,B):-move_east(A,C),move_north(C,B).
move(A,B):-move_east(A,C),move_north_east(C,B).
move(A,B):-move_east(A,C),move_south(C,B).
move(A,B):-move_east(A,C),move_south_east(C,B).
move(A,B):-move_north(A,C),move(C,B).
move(A,B):-move_north(A,C),move_east(C,B).
move(A,B):-move_north(A,C),move_north_east(C,B).
move(A,B):-move_north(A,C),move_west(C,B).
move(A,B):-move_north_east(A,C),move(C,B).
move(A,B):-move_north_east(A,C),move_east(C,B).
move(A,B):-move_north_east(A,C),move_north(C,B).
move(A,B):-move_north_east(A,C),move_north_west(C,B).
move(A,B):-move_north_east(A,C),move_south_east(C,B).
move(A,B):-move_north_west(A,C),move(C,B).
move(A,B):-move_north_west(A,C),move_north_east(C,B).
move(A,B):-move_south(A,C),move(C,B).
move(A,B):-move_south(A,C),move_east(C,B).
move(A,B):-move_south_east(A,C),move(C,B).
move(A,B):-move_south_east(A,C),move_east(C,B).
move(A,B):-move_south_east(A,C),move_north_east(C,B).
move(A,B):-move_west(A,C),move_north(C,B).
N = 28.

So the way this all comes together is that, during learning, the agent tries different moves and puts together a strategy to arrive at its goal, but it’s constrained to only try moves that reduce the distance to its goal. That’s a classic distance-based navigation heuristic.

I hope this gives you an idea of the kind of background knowledge you would have to define if you wanted to deal with floating point numbers, particularly in the context of geometry.

All of the output above is from Louise the MIL system I used in that project, but the same approaches would apply with any recent ILP system. e.g. I believe you could take the same approach with Popper.

Further discussion

Now, I would like to add something, because all this might look like too much work. And it is! Especially if all you’re trying to do is to re-derive a well-known result, like the rules of parallelism.

Unfortunately most of ILP presupposes a great deal of knowledge engineering that goes into constructing a suitable background knowledge base. In the worst case it looks like we are hand-coding the solution and giving the learning system exactly the background knowledge it needs to solve a problem, no more, and no less. That’s a fair criticism and a lot of work on ILP is subject to it.

The challenge then is to find ways to learn without having to manually define all this background knowledge. In my latest work, which I linked above, I show how to cast the ILP learning problem as a grammar learning problem, which basically means that we are trying to learn an automaton of some sort.

So, nowadays, if I was trying to learn the rules for parallelism, I would not think of geometry and floating point arithmetic, as such, but instead I would try to answer the question:

What kind of automaton accepts or generates only parallel lines?.

Then I would try to come up with examples of the output of such an automaton and give it to my system to re-construct that automaton from the examples. In that case, the First-Order background knowledge would only need the pre-terminals in the target language; essentially those would be the states of the target automaton. As to the Second-Order background knowledge, it now becomes a Second-Order Definite Normal Form, which is exactly what it sounds like : a normal form; expressed as a set of second-order definite (datalog) clauses. So for instance we could use a second-order normal form for context-free grammars, like a second-order version of Chomsky Normal Form; or a second-order normal form for context-sensitive grammars, like a second-order version of Kuroda Normal Form.

This is a more advanced method that is still under construction, so to speak, but I am much more excited about it, than the traditional, manual way of hand-crafted background knowledge I show above. However, if this is the first time you work with ILP I think you’d find the traditional way easier to reason about.

@stassa.p Thank you very much for your time and answer to these beginner questions :slight_smile:

This is kind of amazing that you mention grammars, since my own interest and motivation for my first question is precisely this: graphical grammars.
To be more precise, here is a blog post I wrote that define what I mean by this: Triangle · Kwon-Young Choi

Basically, I want to write (or induce automatically) pure relational graphical grammars that can relates high level semantic structures (let’s say a triangle) and a list of graphical primitives (a list of segments).

So, here is my poor attempt at reformulating this pb as you do:

An ILP setup has positive/negative examples + background knowledge:

  • Positive examples would be a set of 3 segments forming a true triangle
  • Negative examples would be any set of any number of segments which do not form a triangle
  • Background knowledge would be spatial equality of 2 points
    • in a purely symbolic geometry setting, we could just reuse prolog unification point(A, B) = point(C, D)
    • in setting with real coordinates, we could use euclidian distance with a threshold (it would be amazing if the threshold could be learnt !)

Then your ILP system (Louise or Popper) could learn the chain of equality that would produce a triangle ?

For more context on what I am interested in:

Basically, my interest stems from trying to write graphical grammars for Optical Music Recognition (OMR).
This is the field of trying to extract musical information from images of music scores.

In contrast to OCR (optical character recognition), which is now dominated by deep learning models, OMR has a very strong and complicated graphical grammar.
I believe (this is my own opinion) that formalizing these graphical grammar rules in code instead of learning them purely through deep/machine learning can be immensely more efficient and accurate.
However, I can assure you that writing these rules by hand is an absolute nightmare (and I love it ^^).
That’s why I am searching for alternatives, which ILP is.

Hi @kwon-young, OMR sounds interesting and yes, maybe that’s a field that ILP could make some progress beyond what’s possible with deep learning. Obviously I’d be really happy if that was the case. Maybe we should take that conversation private to avoid polluting the thread, on the other hand I’d like to hear what the rest of the community has to say.

@EricGT what do you think? I would like to dig into this problem that @kwon-young is intersted in, of learning OMR rules with ILP, specifically systems that I’ve created with SWI-Prolog. Is this the right forum for this kind of discussion and is it something we should share with the rest of the community as a separate thread? Or should we take the discussion to private messages?

@kwon-young , in the meantime, in my paper linked above I show examples of learning L-Systems with the new, self-supervised MIL system Poker (carefuly, that’s not Popper! I know, it’s confusing). L-Systems are a grammar formalism used to describe shapes with branching and self-similarity, like fractals and plants. You can find more information on wikipedia: L-system - Wikipedia.

L-system grammars are used as generators, to generate strings of movement and drawing instructions for a “robot”. This used to be a physical robot drawing on a physical canvas but now it’s usually a simulated robot drawing on a virtual canvas. Complex shapes can be drawn this way, for example below are the rules for the Hillbert Curve fractal, as learned by Poker for one of the experiments in my paper:

s(A,B,C):-y(B,D),minus(A,E),x(E,F),f(F,G),plus(G,H),y(H,I),f(I,J),y(J,K),plus(K,L),f(L,M),x(M,N),minus(N,O),s(O,D,C).
s(A,B,C):-x(B,D),plus(A,E),y(E,F),f(F,G),minus(G,H),x(H,I),f(I,J),x(J,K),minus(K,L),f(L,M),y(M,N),plus(N,O),s(O,D,C).
s(A,B,C):-plus(B,D),plus(A,E),s(E,D,C).
s(A,B,C):-minus(B,D),minus(A,E),s(E,D,C).
s(A,B,C):-f(B,D),f(A,E),s(E,D,C).
s(A,B,B):-empty(A,B).

It’s not immediately easy to see but those are grammar rules. They translate to the following DCG which was used to generate training and testing examples for the experiments in the paper:

hilbert_curve([+,y,f,-,x,f,x,-,f,y,+|Ss]) --> x, hilbert_curve(Ss).
hilbert_curve([-,x,f,+,y,f,y,+,f,x,-|Ss]) --> y, hilbert_curve(Ss).
hilbert_curve([+|Ss])--> plus, hilbert_curve(Ss).
hilbert_curve([-|Ss])--> minus, hilbert_curve(Ss).
hilbert_curve([f|Ss])--> f, hilbert_curve(Ss).
hilbert_curve([]) --> [].

And these are the definitions of the symbols plus, minus, f, x and y, that are also given to the learning system as background knowledge:

plus --> [+].
minus --> [-].
x --> [x].
y --> [y].
f --> [f].

So those are just the pre-terminals of the grammar, like I say abovbe. In the example you give of defining a triangle DCG in your blogpost, I think those pre-terminals, and therefore the background knowledge, could correspond to triangle vertices. Then the system would have to learn the concept of a segment from the ground up. Alternatively you could give the system seg/2 terms but I find that less interesting because if you do that you can also directly define triangles and whatever shape you want. And where’s the fun in that? :slight_smile:

There are more serious reasons than fun also, but let’s first decide how we can continue this conversation. I’ll wait to hear from @EricGT . I’ve polluted this thread too much already!

What does Louise do to refine its Hypothesis Space?

Genetic Algorithms (GAs) in robotics are optimization techniques
inspired by natural evolution, used to solve complex problems like
path planning, trajectory optimization, and robotic design.

The analogy is stricking:

  • Genetic Drift that performs the Adaption:
    GA: Multimembered Binary Crossover of Genetic Code
    ILP: Punctual Refinement based on Conflicting Facts?

  • Genotypes that perform the Adaption:
    GA: Some Genetic Code that leads to a Phaenotype
    ILP: Intensional Rules that lead to Extensional Facts

  • Environment to Adapt to:
    GA: Sometimes Physical World virtually simulated (sic!)
    ILP: Background Knowledge

  • Selection Function to Adapt to:
    GA: Some Fitness Function
    ILP: Positive and Negative Examples

But I didn’t find a paper yet, where ILP would do some cross over? Maybe a
SWI-Prolog prototype? Håkan Kjellerstrand latest Opus was running GA
in Picat for 100 of hours, to discover some physical constant formulas,

using his symbolic regression program, a method which had a revival in 2022.

I am not sure I understand the question. Meta-Interpretive Learning (MIL) that Louise uses is not an optimisation-based algorithm. As I said before it is a second-order form of SLD-Resolution, so it really helps to think of MIL as you think of Prolog proofs, except that some of the variables bound during the proof are predicate symbols; or “functors” in Prolog parlance.

So what does Prolog do to refine its hypothesis space? Nothing, because it does not have anything that can be called a “hypothesis space”. What it has is a “proof space”, which is in practice a structure known as an SLD-tree.

An SLD-Tree is a tree where the root node is the goal that begins the proof, each intermediary node is a new goal derived during the proof by resolution of a program clause and a goal earlier in the tree, and each branch is a program clause from our logic program.

MIL is exactly the same, except a) program clauses also include second-order definite datalog clauses, with second-order variables existentially quantified over the set of predicate symbols; and b) the logic program is not static but includes new clauses derived during the proof, i.e. the “hypothesis” we are learning. In practice the second-order clauses are reduced to first-order clauses with ordinary universally quantified first-order variables and even the clauses of the hypothesis can be left out of the proof. So again you only need to think about every-day first-order SLD-Resolution.

Maybe you can re-phrase your question as “What does Prolog do to refine an SLD-Tree?”.

And the answer is: unification. Ultimately, that’s the “refinement” in SLD-Resolution (and first-order Resolution in general).

Now, if you want, you can look at unification as a reduction of uncertainty, in the sense that when a variable is unified to another term our uncertainty about its true value is reduced. In that sense Resolution with unification is an optimization procedure, specifically one that minimizes entropy (as a measure of uncertainty) I like to think of it that way, but only in the most general terms, i.e. don’t expect me to write down the formula to calculate the entropy of a logic program or its reduction during unification, or anything like that. It’s just an intuition.

Genetic Algorithms are cool but they suffer from the same problems with local optima as gradient-based optimisation (and IIRC gradient optimisation is often used to train GAs). We don’t have that kind of problem with Resolution.

Ok, I have found your Poker implementation in the vanilla repo.
It doesn’t really have any kind of installation instruction, but I suppose it only relies on swi-prolog ?
Is there a specific version needed ?

If is the case, I’ll try to write an example to learn my triangle dcg rule, and probably asks question back when I am blocked :slight_smile:

The thing is that I know that geometry problems like this has already been extensively studied in the literature, but more under the angle of theorem proving I believe.
But I never really did an extensive literature review of the subject, so I don’t really know what I am talking about.
Is there a link between theorem proving and ILP ?

Oh god, the documentation for Vanilla and Poker is woefully lacking. I’m sorry about that.

No, you don’t need anything but the latest version of SWI-Prolog. I recommend the latest development version that’s more likely to be closer to the one I used for development of Poker.

You will definitely need some help from me to use Poker. Many apologies, it is still new and I’m still working on it. In the meantime you can look at the documentation of the earlier version of Louise GitHub - stassa/louise: Polynomial-time Meta-Interpretive Learning · GitHub that shares much functionality with Vanilla and Poker, particularly in the way it uses “experiment files” and configuration options, and also the logging facilities. I really need to just copy some of the stuff over from that repo to Vanilla.

Edit: I just had a look at the Louise documentation to remind myself of what’s in there and there is a section on Vanilla, including how to control recursion. This is mostly the same for the stand-alone version of Vanilla which was really split off Louise at some point. So that will help also. I’ll see about updating the documentation in the Vanilla repo in the next few days if you are going to be working with it, but please just get in touch if you need anything.

For Meta-Interpretive Leraning, certainly! Because it’s really a second-order form of SLD-Resolution, which is a proof procedure. But more broadly, Inductive Logic Programming is all about going from deductive inference to inductive inference, i.e. from reasoning to learning. And what I’ve learned is that there is no difference between the two: induction is deduction raised to a higher order of logic; reasoning and learning are one. I’ve even published that and the reviewers didn’t throw it out as just a poetic metaphor so I think it’s good.

Edit: most other ILP approaches also somehow use proof to learn btw. For example, Popper casts the learning problem as a satisfiability problem and hands it over to a SAT Solver. Inverse Resolution is what it says on the tin, an inverstion of Resolution theorem-proving. And so on. That’s the main idea behind ILP, to use automated theorem proving for learning.

Gradient-based optimisation often refers to differentiable problems.
GAs need not have a differentiable problem, just like nature where
genetic recombination is seen as rather random, GAs use similar

recombination operators. Its more a random swarm walk, when the
GA is multi membered. Sometimes utterly useless walk in down
spirals that even doesn’t come close to local optima. Producing

species that do not survive the next summer. On the other hand does
Louise find a global optimum? I was reading the TOIL paper. It
seems that your Hypothesis Space refinement is a kind of

enumerating the Hypothesis Space:


https://arxiv.org/pdf/2106.07464

It just says “Select M e MM”. And returns an M instance if SLD
derives the empty clause from the negation. So you use resolution
with refutation as the formalization, i.e. the reductio ad absurdum:

G, ~A |- f

Classical logic double negation elimination gives you:

G |- A

So if you run it and show the first found M instance, then the order
in the form of doing your “Select M e MM” and the order in which you
determine instances, is your ranking. If the enumeration defines

your ranking from highest to lowest, then your algorithm finds
a global maximum, by a kind of vacous argument. But what if the
end user wants a more complex preference structure? Like supervised

learning (SL), but with a cost function. So you would not only learn
to imitate a ground truth, and since you use rules have a big amount
of generalization. But instead you would have reinforcement

learning (RL) that minimizes some cost? RL is probably more
common for robotics than only SL. But what do you do with the
generalization that rules have? Do you use a small set for initial

learning, and then validate against a larger set?