I’m back again with a fun ffi question.
For terms, we have PL_get_chars and PL_get_nchars, which lets you pass in a desired representation for your output, one of those representations being UTF8. This means that no matter what the text is actually saved as under the hood (which to my understanding could either be a latin-1 encoding or a wchar encoding), you get a properly converted bit of text back.
For atoms however, we seem to only have PL_atom_chars (deprecated), PL_atom_nchars and PL_atom_wchars. These do not seem to be capable of returning UTF8 at all. Instead, they just return a pointer to the inner contents, with PL_atom_nchars and PL_atom_wchars only minimally validating that the given atom is indeed a text atom. Notably, neither one makes any attempt to figure out if the inner text is indeed ‘nchar’ (in other words, latin-1, though I think this could also be any arbitrary octet sequence) or ‘wchar’ (anything else). Since SWI-Prolog doesn’t store strings as UTF8 internally, this means that only the nchar case, for underlying ascii strings, would return valid UTF8 here (due to inherent compatibility of ASCII with UTF8).
Since I can’t actually know from the ffi what the underlying text encoding is for an atom, it seems like my only option to get an UTF8 string out (or indeed, any string in an expected format, rather than just a raw pointer to some data with no encoding info at all), is to
allocate a term
unify this atom with that term
use PL_get_nchars with UTF8 representation set in the flags
deallocate the term
That seems a little crazy, and this can’t actually be done from every context, since term handling requires an active engine, while atom handling only required an initialized SWI-Prolog environment. Am I missing some function that’d make this more easy for me? Or is SWI-Prolog missing this function?
Note that, when creating atoms, there is actually a PL_new_atom_mbchars which takes a rep flag, allowing the creation of atoms from UTF8 strings. Maybe we need a PL_atom_mbchars for retrieval with a similar contract?
Thanks for your advice.
I considered using one of the stream writing functions for this (in particular I was looking at Ssnprintf) but that didn’t quite seem to be it either. Ssnprintf is documented to print only latin-1 for now. Any of the write functions that requires a stream would require me to first procure such a stream in a context where I have none, and then extract its written contents. That doesn’t seem easier or better than the term workaround I described.
Cause that’s what Rust (and loads of other sane languages today actually) uses for all its strings. So if I want to get some data out of swipl, and treat it as a rust string, I gotta convert to UTF8 somewhere.
I’m sure it supports that code, the problem is on the end where it becomes a nul-terminated string again. Quoth documentation:
Currently buf uses ENC_ISO_LATIN_1 encoding. Future versions will probably change to ENC_UTF8.
The in-memory stream is an option. But again, if I need to create that thing, it’s not necessarily easier or better than the term workaround, as on every conversion I would need to
create the in-memory stream
call the proper stream write function
extract the raw bytes out of the in-memory stream
free the stream
That’s just as much of a workaround as the one with the temporary term, with no obvious benefit over the term workaround.