How to decode bytes as UTF8?

How do I decode bytes to UTF8? My input is base64-encoded UTF8, but base64/2 assumes ASCII, so I end up with

I tried using open_string, but it doesn’t work:

% "foo('├').".encode('utf8') is the binary string b"foo('\xe2\x94\x9c')"
?- open_string("foo('\xe2\x94\x9c').", S), 
   read_term(S, Term, []), close(S).
S = <stream>(0x56015498e560),
Term = foo('âx94\234\').

So, I tried setting the encoding, but:

?- open_string("foo('\xe2\x94\x9c').", S), 
   set_stream(S, encoding(utf8)),  
   read_term(S, Term, []), 
   close(S).
ERROR: No permission to encoding stream `<stream>(0x5601549972d0)'
ERROR: In:
ERROR:   [11] set_stream(<stream>(0x5601549972d0),encoding(utf8))
ERROR:   [10] '<meta-call>'(user:(...,...)) <foreign>
ERROR:    [9] <user>

Converting between bytes and UTF8 is trivial in Python, e.g;

>>> b"foo('\xe2\x94\x9c').".decode('utf8')  # convert bytes to unicode string
"foo('├')."
>>> "foo('├').".encode('utf8')  # convert unicode string to bytes
b"foo('\xe2\x94\x9c')."

Of course, right after posting the question, I figured it out:

?- Z = 'foo(\'â\224\\234\\').', 
   atom_codes(Z, Codes),  
   phrase(utf8_codes(StrCodes), Codes), 
   atom_codes(Str, StrCodes),
   term_string(Term, Str).
Z = 'foo(\'â\224\\234\\').',
Codes = [102, 111, 111, 40, 39, 226, 148, 156, 39|...],
StrCodes = [102, 111, 111, 40, 39, 9500, 39, 41, 46],
Str = 'foo(\'├\').',
Term = foo(├).

Still, it would be nice if I could set encoding on string-stream.

When you get it working as you like, perhaps you might consider adding a topic in the Nice to know category.

No. A string is a sequence of Unicode code points, not a sequence of bytes. How this is represented is up to Prolog and (thus) you cannot assume this to be a char[] inside.

Indeed, utf8_codes//1 is the way to go for en/decoding short UTF-8 strings as one frequently has to do to to in several protocols. Efficient long byte sequences can be by the memory file library. In most cases though, the I/O functions deal with the en/decoding and issues like these are only required for documents that have multiple encodings inside the same document.

2 Likes

Indeed – by not distinguishing between Unicode code points and bytes, Python 2 got itself all messed up and had to add “byte array” as a primitive type in Python 3 … which broke a lot of code because almost nobody had got it right. utf8_codes//1 is indeed the way to go, but it was hard to find in the manual.