Order and sorting

You mean a kind of predmerge/4. Using the sort/4 convention
of specifiying the key argument and the comparison function. You
will create the overhead that you have to pass two Prolog list to C
and then get back a C structure and

create a Prolog list again. Performance wise the complexity of
merge, using any Richard O’Keefe style algorithm, is usually
O(N_1+N_2) where N_1 is the length of the left argument and
N_2 is the length of the right argument.

Richard O’Keefe computes this complexity in his 1990 book.
If N_1 is much bigger than N_2, i.e. N_1 >> N_2, one could be
advised to consider using something else than merge/3. For
example if N_1 is already in a hash table,

adding N_2 many elements only takes O(N_2) time, since a
hash table has an amortisized O(1) insert. This is quite a win
O(N_2) << O(N_2+N_1) when N_1 >> N_2 holds, but needs
more space for the hash table. But hash tables need a

meaningfull and efficient hash on the key argument.

Thanks for the citation, didn’t know that such a thingy exists.

Just a remark on Craft of Prolog by Richard O’Keefe.
If somebody wants an intro into fixpoint calculations
for for example Datalog. You find that also in the 1990

book, just the next sections after ord_union/3. The above
amateur introduction is a little heavy compared
to what Richard O’Keefe presents in his book.

Another likewise introduction could be this XSB tutorial:

Trie-Terms for Sets and Prolog Databases
David S. Warren - ?date?
https://github.com/GunterMueller/XSB_Prolog/blob/e72d74b627a9e4014937abca7941589b27047319/Sources/xsb-code/XSB/lib/prolog_db_doc/prolog_db.pdf

Its hidden inside the source forge distribution of XSB
Prolog, a file by the name prolog_db.pdf. Don’t know
exactly how to cite it.

“The Craft of Prolog” contains code which is very close to the predicates I have now defined in that msorted.pl file I shared above, see for example “6.8 Prefixes” on page 232. There is plenty more. None of it is groundbreaking, I have now shown the same algorithms in C++'s STL. Some of that code has found its way in the SWI-Prolog standard library, and some has not. What could be the reason?

Here are some guesses (I don’t even expect @jan to have a definite answer):

  • No one ever needs any of this in real life;
  • The predicate names are too generic and there are too many ways to achieve the same result;
  • The definitions are in fact trivial;
  • Depending on the purpose of the client code and the concrete problem, subtle differences in the implementation lead to big differences in efficiency,

and so on.

I am at the moment trying to understand and get opinions: is there any use of continuing down the path that I have outlined in this post above, or is this all useless? I tried to guess what makes it useless but I hoped to hear from others.

OK, this is not like that. The merge is quite deeply embedded into the sorting and cannot be used without fancy refactoring with questionable benefit.

I’d be interested certainly, especially in the context of not having to keep a separate copy of library(ordsets) in case its implementation changes and it no longer works on multisets/msorted lists.

But there’s still a nagging feeling of code duplication that bothers me, like the princess and the pea. We can have either two separate libraries, library(ordsets) possibly with new type checking to ensure inputs are ordered sets (sorted with sort/2) and a new library only for multisets again with typechecking to ensure it only operates on multisets (sorted with msort/2).

Or we can have one library, either yours or library(ordsets) that operates on both kinds of sets, possibly with some more complex type-checking to separate the two distinct special cases.

I certainly would prefer to have one, unified library. But then we have duplication with respect to library(ordsets) [Edit: I’m saying that because @jan has indicated he prefers to leave library(ordsets) and its documentation the way they are now, instead of changing the documentation -and, with it, the specification of the implementation- to also cover multisets].

I don’t know what’s best. Maybe the best thing is for you to submit your library as a pack and wait to see if it’s picked up by enough users to warrant an addition to the SWI standard library. Maybe not, because I’m guessing many users don’t use packs often.

It would be so much easier to answer questions like that if Prolog was a more popular language and we had a few thousand users to sample use cases from. :confused:

Here is a short summary of the situation from my point of view:

  • Set operations on multisets as produced by msort/2 are fine but marginally useful; we also now know how to get those with very limited effort
  • There are other algorithms on “sorted” lists (in addition to set operations) that are also useful but they would be far more useful if they supported equivalence
  • By now I am somewhat convinced that the sort/4 interface is superior to the predsort/3 interface and should be used for any algorithms on lists sorted by sort/4 in (SWI-)Prolog

This last point means that defining a comparison predicate is fine, but providing a key and order like sort/4 sounds much finer. The only use cases which are not covered is partially sorting atomic terms (like, sorting on a string suffix?)

To be honest, I think I have scratched my itch for the time being :smiley:

In my view, multisets and normal sets are different things. For one, representing a multiset using duplicate entries may work for cases where the number of duplicates is low, but gets problematic if it is high. So, using a pair representation make (IMO) more sense. I haven’t followed this very closely, but I have the impression that multisets there are options for different semantics for the various popular set operations. As I see it

  • If we want to support multisets, make a proper new library based on existing practice in the Prolog world and/or other programming languages.
  • If ordsets is logically good enough for your project, use it and make a comment about the assumptions you did. If this is mission critical, back this up with tests and/or assertion/1.

I don’t want to argue too much over this because it’s not really that important and you have more serious things to do (cough cough the distorted character highlighting in the editor cough cough) but one thing that came up in the conversations we’ve had on this the last week or so is that if we choose a representation for multisets, that representation will necessarily also work for sets with unique elements because those are just a special case of multisets. For the same reason, if we then make a new library that works on multisets, it will also work on sets with unique elements and then we’ll have two libraries that work on single-element sets: library(ordsets) and the new library. If the two libraries have different representations for sets with single elements, that will only create confusion about what a “set” is in Prolog. Not to mention we already have a different representation of single-element sets with library(nb_sets) as well as parallel functionality in library(assoc) (can be used to represent multisets with elements as keys and values as multiplicities), library(lists) (e.g. is_set/1 and list_to_set/2, but also clumped/2 can be used to collect multiplicities of elements in a mult-element “set”), and possibly others I’m not thinking of right now.

I think none of this is a good idea. Prolog is not a typed language and what we’re discussing here is, essentially, how to define a type set. I don’t think we should. I think the current, somewhat anarchic, situation is the ideal. The user is free to choose the representation that is best for her project, and also add her own as needed. And even if we were to decree that a “set” in Prolog is this or that representation, different user’s needs would still organically result in different users choosing different representations for different use cases.

So the best solution is to do nothing. And isn’t that great? No need for extra work!

Well, about that. assertion/1 will raise errors if a future version’s implementation changes, so I think my best option is to keep a copy of the current version of library(ordsets) in my project so I can use it as I want (to operate on “multisets”). But then, if I want to distribute my project at some point, what should I do? Won’t my local copy clash with the one in the SWI-Prolog library, causing errors and confusion to the user?

What is the good practice here? I can’t just copy/paste the code from library(ordsets) and rename it. That would be plagiarism, essentially. Or is there a way to do that which is covered by the license?

EDIT: just to be clear. Of course I can always implement my own version of multisets as lists sorted with msort/2 as @Boris did in this thread. But I am trying to be the as lazy as possible, in the best software engineering tradition of not duplicating work and reinventing the wheel. The code I need is alrady written, there’s no reason for me to re-write it again. I hope that makes sense?

Hi Jan, we are now at 48 (!) posts here, I understand this is too much.

Yes, fully agree. I wanted to provide such a library, based on existing practice. While researching, I learned something and I wrote some code but I only ended up with more questions :smiley:

Existing practice is to provide algorithms that create or work on sorted lists [1]. The set operations “union”, “intersection”, and so on are part of it. Other examples would be “partition point”, “is sorted until”, or “merge” [2].

I am attaching two files which contain a few such algoritms and a test suite for them, including tests for some of the library(ordsets) predicates:

sorted.pl (6.5 KB)
test_sorted.pl (6.2 KB)

At the moment those only work with compare/3 on the full terms, so equality.

All of this is quite useless because it is using equality and the “set” algorithms are de facto identical to library(ordsets). These algorithms become useful when we can use equivalence instead [3]. Initially, I planned to provide equivalence using a predsort/3 -like interface, where the client code provides a comparison predicate. I have not done it because it is trivial; and because I am not sure that this is the way forward.

Current status: the code in sorted.pl is marginally useful, as a starting point for a discussion for a possible library. The tests were useful to me while writing the code. This could be extended in two orthogonal directions:

  • A presumably more efficient version could be implemented in C as a foreign library, I think I’d know how to do that;
  • Versions of the algorithms that take a comparison predicate with the same signature as compare/3 (the predsort/3 interface) can be trivially added to it.

Now I have answered my original question from the first post in this topic (yay!). This only opened a new question:

Is this useful in Prolog at all? I don’t see an obvious way of providing an efficient implementation of these algorithms for equivalence, with a client-code provided comparison predicate (it will require calling Prolog from C on every comparison, right?). I suspect it is possible to achieve a useful efficient implementation if we used the sort/4 interface instead, where the client code provides a key and an order. I wanted to share this preliminary work in this post, hoping to get some comments. Implementing algorithms on sorted lists that use the sort/4 interface (Key + Order instead of Comparison) is just a vague idea at the moment and it will require more work. I will do eventually :smiley:


[1] I purposefully used the specifications and reference implementation from the C++ algorithms library as a starting point. Section 6, “Sequences” pp 187-259 of O’Keefe’s “The Craft of Prolog” covers the same topics in a very similar manner. For example, 6.8.1 “Prefixes of lists” on page 232 shows another take on partition_point/5 in the attached sorted.pl. Section 5, “Data Structure Design” pp 153-186 contains the “set as a sorted list” examples, which seem to have strongly influenced library(oset) and library(ordsets). See this post above.

There is a powerful idea: keeping the algorithms separate from the data structure. Alex Stepanov, the original author of the notorious “Standard Template Library”, has tried to demonstrate this; his approach is to define the algorithms only through iterators on the data structure. See “Elements of Programming” by Stepanov and McJones and “From Mathematics to Generic Programming” by Stepanov and Rose. O’Keefe also spends some time showing how to convert from one concrete sequence implementation (list, backtracking…) to another.

[2] In my proposal I have only considered algorithms that require a “forward iterator”, so that they can be efficiently implemented on Prolog lists.

[3] Sorted in languages from the C tradition usually means “sorted on a binary relation”, and the default is “less than”. There is no reason to prefer the binary relation over compare/3. If needed, one could define compare_rel/3 in terms of a relation rel/2 like this:

compare_rel(Order, A, B) :-
    (   rel(A, B)
    ->  Order = (<)
    ;   rel(B, A)
    ->  Order = (>)
    ;   Order = (=)
    ).

A binary relation is a total ordering if it is transitive and obeys the trichotomy law, where for every pair of elements, only one holds: the relation, its converse, or equality. Prolog’s standard order of terms with compare/3 is an example of a total ordering. If we replace equality with equivalence, we have the weak-trichotomy law. keysort/2 in Prolog is an example of equivalence, where two terms are equivalent if the keys are equal. sort/4 already supports equivalence with a more expressive interface than keysort/2.


A small observation: nat_sort() in pl-list.c seems to sort using a “sorted until” and “merge”, but those are deeply embedded in the implementation and I didn’t see an obvious way to reuse them for my purposes, other than as inspiration.

It is a bit long post to answer in detail now. I think we first need to decide on a proper representation of multisets, as

  1. A list of Element-Count, ordered by Element where each Element is unique
  2. A list of Elements ordered by Element with duplicates.

For non-standard ordering, the classical solution in Prolog is to use keysort/2 after computing a sensible sort key. ECLiPSe sort/4 generalises sort/2 and keysort/2 in a nice way. This implies we could consider representing multisets a list of e(Key,Count[, Payload]), where the algorithm only acts on Key and Count and an arbitrary payload can be added.

Yes, it was a long post. I left out a lot of detail :rofl:

It also seems that I have so far failed to demonstrate the importance of keeping data structures and algorithms orthogonal to each other. I am also under the impression now that multisets will be covered by whatever it is I tried to propose.

EDIT: specifically, a multiset that just counts is very marginally useful. A useful multiset has equivalent (but not equal!) elements in it – same key but different payload. In other words, you cannot compress the representation to a key and count; in the best case you save the overhead of repeated keys, which in Prolog should be the same term anyway. You can create that kind of multisets with keysort/2 and sort/4, but you cannot correctly work with them unless you relax the trichotomy law (compare/3).

Just to be clear since this got lost in the detail, this is not a proposal for a multiset library. It is a proposal for an efficient and generally useful library on sorted lists.

It happens so that if you chose to represent a set as a sorted list, you can use the library to perform set operations on those lists.

Sorry for the long-winded explanation.

I’m not so convinced. E.g., predsort/3 seemed a good idea when I implemented it. Typically I find myself creating a compound term that has a witness and payload(s) and use sort/4. I’m not sure this is merely for performance or that it is also conceptually cleaner/more Prolog minded. Surely sort/4 is way faster than a Prolog based sort based on a user defined comparison predicate.

Yes, I agree fully. I don’t want to be using predsort/3 at all. I don’t want to provide (in my client code) a compare/3 -like predicate. I only mention it as it demonstrates the interface I do not like. I would prefer something like the sort/4 interface that expects a key specification and an order specification (one of @<, @=<, @>=, and @>). It is clean and there is almost nothing useful you cannot do with it.

I know that I went overboard with my long-winded proposal but I tried and failed to make it any shorter. I also have not yet have a working (ie implemented in code) proposal for the Key + Order interface. I will eventually get to it. I was asking for guidance and opinions.

Yes! To summarize the summary: there are well understood algorithms that work on sorted sequences. I know how to implement those in pure Prolog using a 3-arg “compare” but I don’t like that. I believe at the moment there is a cleaner interface with sort/4 -like Key + Order. I have collected some algorithms in the sorted.pl file and added tests in test_sorted.pl. I chose these algorithms because they are well established and can be efficiently implemented on Prolog lists. Those two files are just a scaffold which would allow me to continue.

I don’t have any concrete questions at the moment but please do comment if you have the time.

An optimization is missing, currently in SWI:

ord_symdiff(<, H1, Set1, H2, T2, [H1|Difference]) :-
    ord_symdiff(Set1, H2, T2, Difference).
ord_symdiff(=, _, T1, _, T2, Difference) :-
    ord_symdiff(T1, T2, Difference).
ord_symdiff(>, H1, T1, H2, Set2, [H2|Difference]) :-
    ord_symdiff(Set2, H1, T1, Difference).

But the relabeled and arguments permuted variant has:

sorted_symdiff_cmp(<, X, Y, A, B, [X|SD]) :-
    sorted_symdiff(A, [Y|B], SD).
sorted_symdiff_cmp(=, _X, _Y, A, B, SD) :-
    sorted_symdiff(A, B, SD).
sorted_symdiff_cmp(>, X, Y, A, B, [Y|SD]) :-
    sorted_symdiff([X|A], B, SD).

Can we not also short-cut and switch sides like in union?

But strangely SWI ord_symdiff/3 is not implemented with single
sided unification, while ord_union/3 does use (=>)/2.

I would prefer to not optimize any of this at the moment. This was really meant as a scaffold, a temporary implementation that allows me to write tests that would allow me to write something more useful. I would have withheld those if I would have known how much effort this would create for you and others.

PS: I tried to use an (expensive!) LLM to help with mechanical code re-writing between C and C++ and Prolog, and back. This failed miserably (probably a skill issue on my end). So for now I am keeping language-specific optimizations out of this, since it makes my own mechanical translation even easier.

But the observation that the count is abs(n-m), is really nice:

?- ord_symdiff([1,2,2,2,2,3], [0,2,2,4], X).
X = [0, 1, 2, 2, 3, 4].

?- ord_symdiff([0,2,4], [1,2,2,2,2,2,3], X).
X = [0, 1, 2, 2, 2, 2, 3, 4].

I copied the specifications from the cppreference website where I also found the reference implementation. I have provided links to everything somewhere up there in this monster thread

I must admit I’m a bit lost … It occurs to me that the sort/4 spec using an argument number of 0 for the entire term seems useful. Not sure about the ordering argument (>, >@, …) for what you are trying to do.

Good point. Let’s get back to this once I have a working demonstration.