Getting an UTF8 name from a text atom in foreign code

maren · July 18, 2022, 3:19pm

Hi friends,

I’m back again with a fun ffi question.
For terms, we have PL_get_chars and PL_get_nchars, which lets you pass in a desired representation for your output, one of those representations being UTF8. This means that no matter what the text is actually saved as under the hood (which to my understanding could either be a latin-1 encoding or a wchar encoding), you get a properly converted bit of text back.

For atoms however, we seem to only have PL_atom_chars (deprecated), PL_atom_nchars and PL_atom_wchars. These do not seem to be capable of returning UTF8 at all. Instead, they just return a pointer to the inner contents, with PL_atom_nchars and PL_atom_wchars only minimally validating that the given atom is indeed a text atom. Notably, neither one makes any attempt to figure out if the inner text is indeed ‘nchar’ (in other words, latin-1, though I think this could also be any arbitrary octet sequence) or ‘wchar’ (anything else). Since SWI-Prolog doesn’t store strings as UTF8 internally, this means that only the nchar case, for underlying ascii strings, would return valid UTF8 here (due to inherent compatibility of ASCII with UTF8).

Since I can’t actually know from the ffi what the underlying text encoding is for an atom, it seems like my only option to get an UTF8 string out (or indeed, any string in an expected format, rather than just a raw pointer to some data with no encoding info at all), is to

allocate a term
unify this atom with that term
use PL_get_nchars with UTF8 representation set in the flags
deallocate the term

That seems a little crazy, and this can’t actually be done from every context, since term handling requires an active engine, while atom handling only required an initialized SWI-Prolog environment. Am I missing some function that’d make this more easy for me? Or is SWI-Prolog missing this function?

Note that, when creating atoms, there is actually a PL_new_atom_mbchars which takes a rep flag, allowing the creation of atoms from UTF8 strings. Maybe we need a PL_atom_mbchars for retrieval with a similar contract?

Thanks,
Matthijs

peter.ludemann · July 18, 2022, 4:05pm

I have this code snippet in pcre4pl.c, which uses the undocumented %Ws format specification and where re->pattern is an atom_t:

  PL_STRINGS_MARK();
  Sfprintf(s, "<regex>(%p, /%Ws/)", re, PL_atom_wchars(re->pattern, NULL));
  PL_STRINGS_RELEASE();

The PL_STRINGS_MARKS and PL_STRINGS_RELEASE aren’t needed if you’re in a foreign function (this particular code is called to display a “blob”, so it’s outside a foreign function).

From a quick look at the code for Svprintf() in src/os/pl-stream.c, it appears that this also handles embedded nulls (I haven’t tried this to verify).

maren · July 18, 2022, 4:14pm

Thanks for your advice.
I considered using one of the stream writing functions for this (in particular I was looking at Ssnprintf) but that didn’t quite seem to be it either. Ssnprintf is documented to print only latin-1 for now. Any of the write functions that requires a stream would require me to first procure such a stream in a context where I have none, and then extract its written contents. That doesn’t seem easier or better than the term workaround I described.

peter.ludemann · July 18, 2022, 4:31pm

Why do you need the UTF8 form internally?
Ssnprintf() calls Svfprintf(), so it should support the %Ws format code.

There’s an incantation for creating an in-memory stream and then getting its contents, but I can’t find it right now.

maren · July 18, 2022, 4:39pm

Cause that’s what Rust (and loads of other sane languages today actually) uses for all its strings. So if I want to get some data out of swipl, and treat it as a rust string, I gotta convert to UTF8 somewhere.

I’m sure it supports that code, the problem is on the end where it becomes a nul-terminated string again. Quoth documentation:

Currently buf uses ENC_ISO_LATIN_1 encoding. Future versions will probably change to ENC_UTF8.
(c('Ssnprintf'))

The in-memory stream is an option. But again, if I need to create that thing, it’s not necessarily easier or better than the term workaround, as on every conversion I would need to

create the in-memory stream
call the proper stream write function
extract the raw bytes out of the in-memory stream
free the stream

That’s just as much of a workaround as the one with the temporary term, with no obvious benefit over the term workaround.

peter.ludemann · July 18, 2022, 4:53pm

(There’s some stackoverflow code for converting from wchat_t to UTF8: c++ - UTF8 to/from wide char conversion in STL - Stack Overflow)
and probably something similar inside the SWI-Prolog source code.)

This code might help you:

github.com

SWI-Prolog/packages-cpp/blob/699be0c5c2b2da7586db20e18e0ff265703c6444/ffi4pl.c#L131


      
              }
              PL_retry_address(ctxt); /* Succeed with a choice point */
            }
          }
          
          
static char* range_ffi_str;
          #define RANGE_FFI_STR_LEN 100
          #define RANGE_FFI_STR_CONTENTS "RANGE_FFI"
          
          
#include <stdio.h> /* TODO: remove when fwprintf() is removed */
          static foreign_t
          w_atom_ffi_(term_t stream, term_t t)
          { IOSTREAM* s;
            atom_t a;
            if ( !PL_get_stream(stream, &s, SIO_INPUT) ||
                 !PL_get_atom_ex(t, &a) )
              return FALSE;
            PL_STRINGS_MARK();
            { size_t len;
              const pl_wchar_t *sa = PL_atom_wchars(a, &len);
              Sfprintf(s, "/%Ws/%zd", sa, len);

and it’s used:

github.com

SWI-Prolog/packages-cpp/blob/699be0c5c2b2da7586db20e18e0ff265703c6444/test_ffi.pl#L107


      
              range_ffialloc(1, 2, X).
          
          
:- end_tests(ffi).
          
          

          
:- begin_tests(wchar).
          
          
% The following "wchar" tests are regression tests related
          % to https://github.com/SWI-Prolog/packages-pcre/issues/20
          
          
test(wchar_1, all(Result == ["//0", "/ /1", "/abC/3", "/Hello World!/12", "/хелло/5", "/хелло 世界/8", "/網目錦へび [àmímé níshíkíhéꜜbì]/26"])) :-
              (   w_atom_ffi('',             Result)
              ;   w_atom_ffi(' ',            Result)
              ;   w_atom_ffi('abC',          Result)
              ;   w_atom_ffi('Hello World!', Result)
              ;   w_atom_ffi('хелло',        Result)
              ;   w_atom_ffi('хелло 世界',   Result)
              ;   w_atom_ffi('網目錦へび [àmímé níshíkíhéꜜbì]', Result)
              ).
          
          
test(wchar_2,

peter.ludemann · July 18, 2022, 8:44pm

I just noticed that there’s also a U format code for “UTF-8 string”, so perhaps “%Us” (for wchars) or “%Uc” will do what you want?

Topic		Replies	Views
UTF-8 encode and decode (Discussion) Discussion	8	560	March 3, 2020
Encoding set to "text" iso "utf8" whereas LC_CTYPE=UTF-8 Help!	3	573	March 7, 2020
Suppress confusion! Two suggestions for the online docs (atom_codes/2, atom_chars/2) Request For Comments discussion	3	703	April 26, 2020
Ann: SWI-Prolog 8.5.14 Releases	0	410	July 8, 2022
Documentation of PL_put_term_from_chars() and PL_chars_to_term() SWI-Prolog web site and services	1	546	April 4, 2019

Getting an UTF8 name from a text atom in foreign code

Related topics