(C/C++) Blobs and the flags for "unique" and "nocopy"

A foreign blob has PL_BLOB_UNIQUE and PL_BLOB_NOCOPY flags. (There are also PL_BLOB_TEXT and PL_BLOB_WCHAR, but as far as I can tell, user-defined blobs shouldn’t use them.)

The PL_BLOB_UNIQUE flag is used in testing for equality:

Each time an atom or a PL_BLOB_UNIQUE blob is created, it must be looked up in the atom table;
if a blob without PL_BLOB_UNIQUE is created, no lookup is done.

However, looking at the implementation of PL_unify_blob(), which calls lookupBlob(), the “no lookup” is only done if both PL_BLOB_UNIQUE and PL_BLOB_NOCOPY are specified.

PL_BLOB_NOCOPY also controls whether the release() callback is used – if PL_BLOB_NOCOPY is false, Prolog owns the memory and simply does a free(). This means that if a copyable blob contains some resources, those resources won’t be freed.

Some of this is less of a problem with foreign blobs implemented in C, but become a problem with C++, which supports “copy constructors” and “delete” methods for objects – that is, C can simply use memcpy() and free() but C++ additionally requires calling a copy constructor instead of memcpy() and has additional semantics with its “delete”. (The API allows replacing memcmp() for comparison (compare() callback), has limited support for “delete” (release() callback) and nothing for “copy”.)

Proposals:

  • Add a PL_BLOB_MUTABLE flag – if specified, lookup always uses the blob address. PL_register_blob() should raise an error if both PL_BLOB_MUTABLE and PL_BLOB_UNIQUE are set.
    • An alternative is to change the test for using the address instead of the contents to check only the PL_BLOB_NOCOPY flag and ignore PL_BLOB_UNIQUE.
  • Add copy() and/or move() callback(s) to PL_blob_t to handle the situation where the blob is copyable but memcpy() isn’t suitable.
  • Add an unacquire() callback for the situation where a blob is copyable but contains resources (the inverse of acquire()).

Thanks for the long analysis and, in particular, figuring out why valgrind complains on uninitialized data when creating a blob from a C++ instance as you described in a private mail). I think you made some mistakes in the analysis (but of course, I could also be wrong).

As is, PL_BLOB_UNIQUE says the blob is unique, which means a second lookup returns the existing one. PL_BLOB_NOCOPY indeed says that the memory is owned by foreign code and may be mutable. So, if both are set, we consider the content equal if they point at the same address and the hash is determined by the address. Protected by if ( true(type, PL_BLOB_UNIQUE) ) (line 553), we only search for an existing atom if the PL_BLOB_UNIQUE flag is set.

In your case (PL_BLOB_NOCOPY without PL_BLOB_UNIQUE, we determine the location where we insert the atom using the hash of the content. That is just to balance the hash table. Any (pseudo) random value will do as we add it to the table, but will never do a lookup on it. And indeed, using if ( true(type, PL_BLOB_NOCOPY) ) for line 540 works just fine. Possibly that is better. The choice for hashing on the content was (I think, it is long ago), motivated by assuming the distribution would be better. It is unclear whether that is true or not. Typically the content has more data, but on the other hand, the address is known to be unique and the content may be far too big as well as be always the same. Using the address should fix the valgrind report. What do you think?

No. The release() is always called. See invalidateAtom(). The only difference is that if this hook succeeds, Prolog assumes the data ownen by foreign code (PL_BLOB_NOCOPY) is freed by this hook. Else, Prolog is the owner and thus frees the content.

I’m a little lost here. For C++ objects you obviously want the flags you use (no copy, not unique). Prolog will never inspect the content of the blob except through the hooks associated by the blob type. C++ should not move or delete the memory. I think that should be fine for objects allocated using new(), no? Actually, C++ may delete the memory by calling PL_free_blob() that was added recently to deal with Python objects. This resets the content pointer of the blob to NULL. It does not release the blob (it may be in use) and will call the type hooks normally. These hooks must be aware that the data pointer can be NULL.

Actually, it is not entirely sure you want PL_BLOB_UNIQUE. For PyObject*, Janus uses PL_BLOB_NOCOPY|PL_BLOB_UNIQUE, so if we create a Python object reference from the same Python object, we get the same blob. The acquire() callback is called only the first time and it increments the Python object reference count. The release() callback decrements the reference count (well, if you examine the code you’ll see this is a little harder as one needs to hold the Python GIL to decrement, but if we try to get the GIL we may deadlock, so we collect these objects and decrement their reference whenever we hold the GIL for some other reason). py_free/1 call PL_free_blob(), which calls the release() hook immediately (but it will be called again when AGC destroys the atom). Is that also not more suitable for the C++ wrapper?

So, this seems to work fine. Unless my assumption about the distribution makes sense, I’m ok changing that. I do not see a reason for a PL_BLOB_MUTABLE yet.

? The callbacks act on Prolog’s initiative, i.e., if Prolog wants to do something with the blob (mostly write, compare and release), no? Prolog does not copy or move blobs. If C++ wants to copy the object it may do so. It can then decide to associate the copy with a new blob or not. If C++ wants to move the object and keep it associated to the same blob that could be done by adding something like PL_move_blob(blob, ptr). For non-unique no-copy blobs, that is trivial to implement.

I don’t understand this. Probably my poor understanding of C++ :frowning: Can you explain? The acquire() is actually not needed for non-unique blobs as it is always called and thus you can do the work yourself when creating the blob. For unique blobs it distinguishes an existing blob (no callback) from a new blob (call back). unacquire() is (I think) release(). At least, that was how it was intended. See Python object remarks above.

Before I expose more of my ignorance, let me give my interpretation of “unique” vs “non unique”. (I’ll get to the other items after I re-read the code, and also after confirmation of my understanding of “unique”.)

All atom and blob objects are kept in an “atom table”, and each object is identified by its atom_t value (an unsigned integer). To avoid confusion with other terminology, I’ll call this the “atom_t-value”.

The function lookupAtom() searches for an atom with the same value and returns its atom_t-value. If nothing is found, a new atom is created. Atoms created this way are called “unique” because an equality test can be done using just the atom_t-value.

So, the definition I use for “unique” is: equality can be tested using only the atom_t-value – there is no need to use the compare() callback. This is essentially the same as the Lisp notion of an "intern"ed atom (Python has a similar notion: sys.intern(string)).

A “unique” object must be immutable; otherwise, lookup would be inconsistent; mutable “unique” objects would also allow duplicates in the table).

lookupAtom() calls lookupBlob(), with the flags PL_BLOB_UNIQUE|PL_BLOB_TEXT; lookupBlob() also handles the case of an object (blob) being “not unique”. For “not unique”, there is no search for an entry with the same value; a new atom_t-value is allocated each time.

The lookup uses same_name(), which uses memcmp() – shouldn’t it use the compare() callback if it’s defined?

lookupAtom() also saves the hash value in the atom or blob’s hash_value field (for “unique” atoms, it uses the atom’s content; for “non unique”, the atom’s/blob’s address). But the term_hash/2 predicate (in pl-termhash.c calls primitiveHashValue()) calls (line 96) MurmurHashAligned2() on the hash_value rather than just returning the hash_value – that is, it hashes the hash value. This seems wrong.

1 Like

That IMO depends on how you define equality. It that based on content or on identity? For a normal Prolog atom (text), it is clearly the content and this indeed should be immutable. If a blob represents a Python object, identity means it is the same object, not that the object is immutable. That form of identity is implemented by the PL_BLOB_NO_COPY. Whether or not we should distinguish ownership and mutability is another matter. I’m happy to do so if there is a convincing use case.

If you want some user-defined notion of content based equality, yes. I’m not sure we need that. Surely, it is not trivial to implement as we have no atom (yet) for the new thing to be shared/added. So, we either need a second compare hook or we need to create some fake atom from the new data.

Looks like I was hallucinating. I’ll fix that. Unfortunately the term hash is supposed to be stable. It has changed before in the development series though (and thus eventually in the stable versions).

My working definition of equality is “if two things look the same, they are equal”. But if PL_BLOB_UNIQUE means that equality is defined by identity (which I call “atom_t-value”), then I’ll need to re-think what I wrote.

My naïve definition – for immutable objects – means that using identity (“atom_t-value”) to determine equality is purely an optimisation. For mutable objects, things become more complicated, which is what I was trying to be more precise about.

library(pcre) handles this by caching the regexp string & options (previously: using assert/1; now: using tabling); so I think that object creation and checking for equality can be handled with a light Prolog wrapper and no need to have a version of compare() that can be called before the blob is created.

(pack(rocksdb) allows specifying a database connection using an atom alias, similar to how current_output et al work … this is handled by a thread-safe lookup AtomMap, defined in SWI-cpp2-atommap.h)

In Java and Python, if a class defines an equality function, it also should define a hash function. So, shouldn’t blobs define their own hash function? (Presumably this would require exposing MurmurHashAligned2())

This would only matter if the hash is stored (and used) externally. Do any of the external data formats (e.g., in a .qlf file or using fast_term_serialized/2) use the hash?

PL_BLOB_UNIQUE means do lookup the atom and (thus) a lookupBlob() may either create a new blob or return an existing one. If PL_BLOB_NO_COPY is also present, equality is defined by identity. Else, it is defined by content (and tested simply using memcmp()).

I agree this is a little simple minded. On the other hand, I have trouble finding a good use case where we want something else. In theory we could of course have an application where we want to manage the data ourselves and we want some user defined content based equality. The API isn’t very suitable for this as, after creating your object and trying to do a lookup the caller does not know whether it got a new blob or an existing blob and thus whether or not to delete the new object again. It may be simpler (at least from an API perspective) to find the already existing object by whatever means the foreign code wishes and use pointer-based equality afterwards.

The classical definition (I think from Quintus) defines the hash to be stable, so you can generate files using hashes to relate objects and load these in a new Prolog instance. I doubt it is used much. Although, I recall @dgelessus (?) complaining about the fact that the hash functions of SICStus and SWI-Prolog are different. This hints as relying on them between process instances.

Just to close this off, the C++ implementation of PlBlob uses PL_BLOB_NOCOPY, so the discussion of “hash” and “compare” isn’t relevant. This is because the blob/atom lookup code has been fixed to use the address for the hash rather than the contents – and for a blob, the contents can change, unlike for an atom.