Unexpected semantics of library(ordsets) for ordered _lists_ (not sets)

Better explain this with an example:

190 ?- _Ls = [a,b,c], _Rs = [a,a,a,b,b,c,c,c,c], ord_intersect(_Ls,_Rs,Is).
Is = [a, b, c].

191 ?- _Ls = [a,a,b,c], _Rs = [a,a,a,b,b,c,c,c,c,d,d], ord_intersect(_Ls,_Rs,Is).
Is = [a, a, b, c].

192 ?- _Ls = [a,a,b,b,c], _Rs = [a,a,a,b,b,c,c,c,c,d,d], ord_intersect(_Ls,_Rs,Is).
Is = [a, a, b, b, c].

193 ?- _Ls = [a,a,b,b,c,c], _Rs = [a,a,a,b,b,c,c,c,c,d,d], ord_intersect(_Ls,_Rs,Is).
Is = [a, a, b, b, c, c].

194 ?- _Ls = [a,a,b,b,c,c,c], _Rs = [a,a,a,b,b,c,c,c,c,d,d], ord_intersect(_Ls,_Rs,Is).
Is = [a, a, b, b, c, c, c].

195 ?- _Ls = [a,a,b,b,c,c,c,c], _Rs = [a,a,a,b,b,c,c,c,c,d,d], ord_intersect(_Ls,_Rs,Is).
Is = [a, a, b, b, c, c, c, c].

196 ?- _Ls = [a,a,b,b,b,c,c,c,c], _Rs = [a,a,a,b,b,c,c,c,c,d,d], ord_intersect(_Ls,_Rs,Is).
Is = [a, a, b, b, c, c, c, c].

In other words, ord_intersect/3 is happy to operate on a sorted list with duplicates (essentially, a multiset), despite the documentation saying that the entire library only operates on ordered sets:

Ordered sets are lists with unique elements sorted to the standard order of terms (see sort/2). Exploiting ordering, many of the set operations can be expressed in order N rather than N^2 when dealing with unordered sets that may contain duplicates.

From: SWI-Prolog -- library(ordsets): Ordered set manipulation

At least some of the other predicates in the library work the same way:

201 ?- _Ls = [a,a,a,b,b,c,c,c,c], _Rs = [a,a,a,b,b,c,c,c,c,d,d], ord_subset(_Ls,_Rs).
true.

202 ?- _Ls = [a,a,b,b,c,c], _Rs = [a,a,a,b,b,c,c,c,c,d,d], ord_subtract(_Ls,_Rs,Ds).
Ds = [].

203 ?- _Ls = [a,a,b,b,c,c], _Rs = [a,a,a,b,b,c,c,c,c,d,d], ord_subtract(_Rs,_Ls,Ds).
Ds = [a, c, c, d, d].

% Ugh, this one may be slightly off!
204 ?- _Ls = [a,a,b,b,c,c], _Rs = [a,a,a,b,b,c,c,c,c,d,d], ord_union(_Ls,_Rs,Ds).
Ds = [a, a, a, b, b, c, c, c, c, d, d].

205 ?- _Ls = [a,a,b,b,c,c], _Rs = [a,a,a,b,b,c,c,c,c,d,d], ord_symdiff(_Ls,_Rs,Ds).
Ds = [a, c, c, d, d].

As it happens I went looking at the implementation because I wanted an intersection predicate operating on lists with duplicates, and I thought that sorting the list would improve performance. I went to look at the implementation of ord_intersect/3 to see if I could cannibalize it for patterns to use in my predicate, and I thought to check, and realised it works on ordered lists with duplicates.

Now, I would like to keep this behaviour, but what is the right thing to do here in terms of the library? Should the documentation change to reflect the true behaviour of the code? Or should the code change to match the informal specification given in the documentation?

I’m for the former: keep the behaviour, change the documentation. After all, the current implementation works with ordered sets just fine, it’s just that it seems to work also with essentially multisets. I don’t think that’s a bad thing, it’s just that the library is less restricted than advertised. I don’t think having multiples changes the performance much either, but haven’t checked to be honest.

Sorry if I read this too fast. In the docs to is_ordset/1 I find:

True if Term is an ordered set. All predicates in this library expect ordered sets as input arguments. Failing to fullfil this assumption results in undefined behaviour. Typically, ordered sets are created by predicates from this library, sort/2 or setof/3.

(emphasis mine)

In other words, it is up to the programmer to ensure that the pre-condition holds.

Sorry, is the programmer the user, or the person implementing the library?

From a quick eyballing it looks like the behaviour is not so much undefined, as undocumented, and it seems to make sense for the most part, except from what I’ve seen so far for ordset_union/3.

Having this unexpected behaviour documented to help the user better understand how to use the library can’t be a bad thing, no?

Besides, if we wanted a new library that operates on ordered multisets we’d have to basically copy most of the code from the current library, and wouldn’t that be a bit daft?

Leaving a behaviour undefined allows more flexibility in the implementation – if the behaviour is documented, then it’s possible that some future optimisation would be impossible without breaking compatibility.

Would you expect ordset_union/3 to work if one of the lists were an unordered list? An “ordset” is defined as having no duplicates (that’s what sort/2 does, unlike msort/2). And the library is defined to only work with “ordsets”, that is sorted with no duplicates (and not multisets). If Prolog were a typed language, we might be able to make this obvious at compile time; as it is, you’ll just get a wrong answer (and we’ve had long discussions in the past about failure versus throwing an error, although not about ordsets specifically).

If in doubt, use list_to_ord_set/2.

Yes, I agree, we could argue that the code is the spec and so on. However, in this case, the docs explicitly say that for anything but lists as created by sort/2 and setof/3 the result is undefined behaviour.

I am starting to doubt a bit. What would be the intersection if you have duplicates? My gut feeling tells me the intersection shouldn’t have duplicates.

Or are you aiming to keep some number of duplicates in your intersection?

I agree with your comment in general terms, but library(ordsets) is an old library (apparently inherited from Quintus if I’m not mistaken?). I’m guessing it’s very unlikely that the implementation will change any time soon. Besides, it’s not like there’s any problem in the way it works with ordsets, so why would the implementation change? We could change it so that it doesn’t “work” (depending on how we define that) with ordered multisets, but I’m just unsure what would be the motivation for that.

It’s a library for ordsets, that works just as expected on ordsets and happens to (at least to some extent) work on ordered multisets also. What’s the right thing to do, is my question really.

EDIT: I guess it’s always possible there’s some bug in the library that was not found in the probably many years of its use, or some optimisation still to be made, that would change the behaviour on “ordered multisets” as I call them, so that’s two cases where the implementation could change.

Ah, that’s a good question. It didn’t occur to me to look around for a definition of set operations on multisets. It’s possible the results I see are wrong for multisets. I’ll have to check and come back to you on that.

In my use case I want to count the duplicates in the intersection of two sets. The sets are adjacency relations in an image and the intersection is the common adjacency relations. I want to know what are the adjacency relations most frequently shared between the two images, as part of a measure of similarity between two images.

It’s a very particular use case and I don’t know if it extends outside the strict use case I have.

I guess for my use case the best option is to just copy/paste the code in my project and keep a copy of it that I know works the way I want it “safe” from future changes, that might take away the undocumented behaviour (unlikely as I think those to be). I just don’t like the code duplication.

A bit more on that. The library’s documentation says that some set operations are cheaper on ordered lists:

Exploiting ordering, many of the set operations can be expressed in order N rather than N^2 when dealing with unordered sets that may contain duplicates. 

But note that this leaves the question of operating on ordered sets that may contain duplicates open. My intuition is that the same savings can be expected when a list is ordered but contains duplicates. When I started on my own implementation of “multiset interrsection” the first thing I did was to sort the two input lists retaining duplicates, because I thought that then I would only have to make one pass through both lists (walking over them in tandem probably).

So I think the natural way to implement both operations efficiently, intersection on ordered sets without duplicates and ordered sets with duplicates, is essentially the same, hence the observed undocumented behaviour that the predicate works on both kinds of list.

But, again I have to check what the intersection of two multisets is supposed to be, formally. Gonna have a look now.

I got curious and started looking at the C++ standard library. I found this:

Constructs a sorted range beginning at d_first consisting of elements that are found in both sorted ranges [first1, last1) and [first2, last2).

If [first1, last1) contains m elements that are equivalent to each other and [first2, last2) contains n elements that are equivalent to them, the first std::min(m, n) elements will be copied from [first1, last1) to the output range, preserving order.

Is that something that would work for you?

It seems to me at the moment that the “first min(m, n) elements” is what the current library(ordsets) implementation does not do. It always takes the number of elements from the first argument to ord_intersection/3?

PS: note that those algorithms use “equivalence”. If “a and b are equivalent” it means that neither a < b nor b < a.

Uhh… I didn’t get that to be honest :sweat_smile: Sorry, I think I’m thinking too much of ordered lists right now, whereas the C++ implementation works by copying one … array? Into another? I have to think about it a bit.

But! I found this helpful document from the University of Luxembourg that gives an intuitive definition of multiset operations:

The intersection of two multisets is the largest multiset which is included in both (simply
take the common elements), while the union is the smallest multiset which includes both
(for the union one has to take each element with sufficient multiplicity). Of course we
may similarly define sum, intersection, and union for several multisets (or just for one
multiset, the result being the multiset itself).

According to that (informal) definition, the way that library(ordsets) works on ordered lists with duplicates seems to be the expected behaviour if it was meant to operate on both ordered sets and ordered multisets.

You might need to squint at this a bit but from what I can tell even ord_union/3 that I thought was doing something iffy is behaving correctly for multisets.

No, I don’t think so. Look:

362 ?- _Ls = [a,a,b,b,c,c], _Rs = [a,a,a,b,b,b,c,c,c,c,d,d], ord_intersect(_Ls,_Rs,Ds).
Ds = [a, a, b, b, c, c].

363 ?- _Ls = [a,a,b,b,c,c], _Rs = [a,a,a,b,b,b,c,c,c,c,d,d], ord_intersect(_Rs,_Ls,Ds).
Ds = [a, a, b, b, c, c].

Note that I changed the order of arguments in the second call. I think this works as expected.

I also have to read the ordsets library again. But about this specifically, those are ranges, not arrays. Ranges (and iterators) is what allows to have abstract algorithms that work on different containers. And the page I linked, again, is a spec, not the implementation!

Riight, because the formal definition of multiset operations is given in terms of their elements’ multiplicities (i.e. the number of times an element is repeated) at least as I see it on the wikipedia page, so that makes sense that the spec defines the same operations in terms of ranges, I guess.

EDIT (sorry, I’m thinking while writing):

The Prolog implementation doesn’t explicitly count anything, as usual. I wonder if the behaviour I observed is an effect of that. I mean, maybe (strong maybe) if set operations are implemented efficiently in the most obvious manner, by walking over ordered lists head-to-tail, then the correct behaviour on multisets naturally “falls off” this efficient implementation as a natural consequence. Multiset operations are a generalisation of set operations after all, so maybe this is a quirk of the Prolog implementation being sufficiently abstract by avoiding explicit element counts.

Since it says “unique” one can exploit this rule, in the
implementation of ord_union/3 for example:

union3(=, H1, T1, _H2, T2, Union) =>
    Union = [H1|Union0],
    ord_union(T1, T2, Union0).

Basically advancing the two arguments simultaniously.
Now for arguments that violate the unique element
requirement, the result is max-occurence-count:

/* SWI-Prolog 10.1.1 */
?- ord_union([a, c, d], [b, c, c, e], X).
X = [a, b, c, c, d, e].

?- ord_union([a, c, c, c, d], [b, c, c, e], X).
X = [a, b, c, c, c, d, e].

But multisets sometimes also want sum-occurence-count.

Then it wouldn’t be an ordered set but would be an ordered multi-set.

My suggestion: copy the code for ord_intersect/3, make a new predicate for yourself called ord_multi_intersect/3, and modify it to suit your needs. If it proves to be valuable, you could create a new library(ordmultisets).
Or follow Boris’ suggestion and convert the C++ code for set intersection and translate it to Prolog (you could probably modify the code to work with Prolog lists, but that would require learning how to write extensions and how Prolog lists are represented in memory).

I think that would be a copy of library(ordset) with a different name.

I don’t know what’s the motivation for that. library(ordset) as it is now is asymptotically efficient, even optimal probably. Why bother hacking lists?

Interestingly multi sets can be easily implemented deriving
from ordset. While ordset would take a list. A multi set could be
represented by a key value list. Where the value indicates

the count. The complexity of operations is usually O(N+M),
where N, M are the argument lengths. This is the same
complexity for union, difference etc.. across set and

multi set. If we allow negative counts, we would have something
like negative membership. Bonus keysort/2 can almost
normalize it. Not quite but almost:

?- keysort([b-3,c-5,a-1], X).
X = [a-1, b-3, c-5].

?- keysort([a-(-1),a-1], X).
X = [a- -1, a-1].

?- keysort([a-2,a-3], X).
X = [a-2, a-3].

For the last two one might desire [] and [a-5] as result.

The C code would probably be faster, but perhaps not much faster. And it would be a lot more work than modifying the existing ordset code for your needs. But if the algorithm does what you want and it’s not easy to modify the Prolog code to do what you want, it might be worthwhile. I was merely offering another possibility.

I was also assuming that you didn’t need to modify all of library(ordsets) but only the intersection code.

If one doesn’t use key value pairs, one can use msort/2 to normalize.
The library(ordsets) has already a correct intersection operation
that realizes the min-occurence-count multi-set style intersection.

It is based on a similar spill over as found for union:

/* SWI-Prolog 10.1.1 */
?- ord_intersect([a, c, d], [b, c, c, e], X).
X = [c].

?- ord_intersect([a, c, c, c, d], [b, c, c, e], X).
X = [c, c].

One can use clumped/2 to verify that min-occurence-count was realized:

?- clumped([a, c, c, c, d], X).
X = [a-1, c-3, d-1].

?- clumped([b, c, c, e], X).
X = [b-1, c-2, e-1].

?- clumped([c, c], X).
X = [c-2].  /* 2 = min(2, 3) */

I had to delete this. It had some half-correct (ie wrong) comments on the topic.

I hadn’t thought of that, I guess that’s a representation closer to the formal definition (as I understand it). As a bonus, that kind of representation is amenable to further processing with library(assoc) (SWI-Prolog -- library(assoc): Association lists).

My only concern would be that, often, multisets would be initially given as flat lists with duplicates, possibly unsorted, and they would then have to be put in the key-value pair (association list) representation. Which is certainly not the end of the world. It’s just one extra step to translate between the flat list and the key-value pair list.

Indeed, in my use case, I start with two flat lists with duplicates, I get their intersection, and put the intersection in the representation you propose, with each element associated to its count (i.e. its multiplicity) in the intersection.

Interestingly, this operation is a run-length encoding (RLE) of a list according to the good, old 99 Prolog Problems (Problem 10 in this site: P-99: Ninety-Nine Prolog Problems I don’t even know which one is the original anymore).

I’ve had the following code in my personal libraries since forever and it’s always handy for one thing or another. It may even be my own solution to the 99-Problems question from the time I was first learning Prolog. From what I can see, it does a single pass through the list, but if the list is unsorted it produces multiple pairs with the same key and different values (i.e. the same element with multiple counts). If the list is sorted first, it gives you an ordered list of elements and their counts. I have that as a separate predicate in my library. I give both below.

Please excuse the code, it is very old :slight_smile:

%!	frequencies(+Sample,-Occurences) is det.
%
%	Count the Occurences of each event in a Sample.
%
%	Occurrences is a list of key-value pairs, where keys are the
%	events in Sample and values their frequencies.
%
frequencies(List,Frequencies):-
	% Sort List for faster aggregation of element counts
	sort(0,@=<,List,Sorted)
	% Count contiguous element "runs".
	% Since the List is sorted that's each element's total count
	,run_length_encoding(Sorted, Frequencies).


%!	run_length_encoding(+List, -Run_length_encoding) is det.
%
%	Converts a list to its run-length encoded form where each "run"
%	of contiguous repeats of the same element is replaced by that
%	element and the length of the run.
%
%	Run_length_encoding is a key-value list, where each element is a
%	term:
%
%	Element:term-Repetitions:number.
%
%	Example query:
%	==
%       ?- run_length_encoding([a,a,a,b,b,b,b,b,b,c,c,c,a,a,f], RLE).
%	RLE = [a-3, b-6, c-3, a-2, f-1].
%	==
%
run_length_encoding([], []-0):-
	!. % Green cut.
run_length_encoding([Head|List], Run_length_encoded_list):-
	run_length_encoding(List, [Head-1], Reversed_list)
	,reverse(Reversed_list, Run_length_encoded_list).


%!	run_length_encoding(+List,+Initialiser,-Accumulator) is det.
%
%	Business end of run_length_encoding/3. Calculates the run-length
%	encoded form of a list and binds the result to the Accumulator.
%	Initialiser is a list [H-1] where H is the first element of the
%	input list.
%
run_length_encoding([], Fs, Fs).
% Run of N consecutive identical elements
run_length_encoding([C|Cs],[C-F|Fs], Acc):-
        % Backtracking would produce successive counts
	% of runs of C at different indices in the list.
	!
	,F_ is F + 1
	,run_length_encoding(Cs, [C-F_| Fs], Acc).
% End of a run of N consecutive identical elements.
run_length_encoding([C|Cs], Fs, Acc):-
	run_length_encoding(Cs,[C-1|Fs], Acc).

That should be a simple tweak of my code above, to apply an arithmetic function to the value of a key-value pair, rather than increment the value. Any function could do here. With a bit more work that would be a basis to implement all multiset operations, since they’re basically operations on elements’ multiplicities.

Oh, haha. Read this after I made this comment. I guess that’s the RLE of a list then.