Getting an UTF8 name from a text atom in foreign code

Hi friends,

I’m back again with a fun ffi question.
For terms, we have PL_get_chars and PL_get_nchars, which lets you pass in a desired representation for your output, one of those representations being UTF8. This means that no matter what the text is actually saved as under the hood (which to my understanding could either be a latin-1 encoding or a wchar encoding), you get a properly converted bit of text back.

For atoms however, we seem to only have PL_atom_chars (deprecated), PL_atom_nchars and PL_atom_wchars. These do not seem to be capable of returning UTF8 at all. Instead, they just return a pointer to the inner contents, with PL_atom_nchars and PL_atom_wchars only minimally validating that the given atom is indeed a text atom. Notably, neither one makes any attempt to figure out if the inner text is indeed ‘nchar’ (in other words, latin-1, though I think this could also be any arbitrary octet sequence) or ‘wchar’ (anything else). Since SWI-Prolog doesn’t store strings as UTF8 internally, this means that only the nchar case, for underlying ascii strings, would return valid UTF8 here (due to inherent compatibility of ASCII with UTF8).

Since I can’t actually know from the ffi what the underlying text encoding is for an atom, it seems like my only option to get an UTF8 string out (or indeed, any string in an expected format, rather than just a raw pointer to some data with no encoding info at all), is to

  1. allocate a term
  2. unify this atom with that term
  3. use PL_get_nchars with UTF8 representation set in the flags
  4. deallocate the term

That seems a little crazy, and this can’t actually be done from every context, since term handling requires an active engine, while atom handling only required an initialized SWI-Prolog environment. Am I missing some function that’d make this more easy for me? Or is SWI-Prolog missing this function?

Note that, when creating atoms, there is actually a PL_new_atom_mbchars which takes a rep flag, allowing the creation of atoms from UTF8 strings. Maybe we need a PL_atom_mbchars for retrieval with a similar contract?

Thanks,
Matthijs

I have this code snippet in pcre4pl.c, which uses the undocumented %Ws format specification and where re->pattern is an atom_t:

  PL_STRINGS_MARK();
  Sfprintf(s, "<regex>(%p, /%Ws/)", re, PL_atom_wchars(re->pattern, NULL));
  PL_STRINGS_RELEASE();

The PL_STRINGS_MARKS and PL_STRINGS_RELEASE aren’t needed if you’re in a foreign function (this particular code is called to display a “blob”, so it’s outside a foreign function).

From a quick look at the code for Svprintf() in src/os/pl-stream.c, it appears that this also handles embedded nulls (I haven’t tried this to verify).

Thanks for your advice.
I considered using one of the stream writing functions for this (in particular I was looking at Ssnprintf) but that didn’t quite seem to be it either. Ssnprintf is documented to print only latin-1 for now. Any of the write functions that requires a stream would require me to first procure such a stream in a context where I have none, and then extract its written contents. That doesn’t seem easier or better than the term workaround I described.

Why do you need the UTF8 form internally?
Ssnprintf() calls Svfprintf(), so it should support the %Ws format code.

There’s an incantation for creating an in-memory stream and then getting its contents, but I can’t find it right now.

Cause that’s what Rust (and loads of other sane languages today actually) uses for all its strings. So if I want to get some data out of swipl, and treat it as a rust string, I gotta convert to UTF8 somewhere.

I’m sure it supports that code, the problem is on the end where it becomes a nul-terminated string again. Quoth documentation:

Currently buf uses ENC_ISO_LATIN_1 encoding. Future versions will probably change to ENC_UTF8.
(c('Ssnprintf'))

The in-memory stream is an option. But again, if I need to create that thing, it’s not necessarily easier or better than the term workaround, as on every conversion I would need to

  1. create the in-memory stream
  2. call the proper stream write function
  3. extract the raw bytes out of the in-memory stream
  4. free the stream

That’s just as much of a workaround as the one with the temporary term, with no obvious benefit over the term workaround.

(There’s some stackoverflow code for converting from wchat_t to UTF8: c++ - UTF8 to/from wide char conversion in STL - Stack Overflow)
and probably something similar inside the SWI-Prolog source code.)

This code might help you:

and it’s used:

I just noticed that there’s also a U format code for “UTF-8 string”, so perhaps “%Us” (for wchars) or “%Uc” will do what you want?