How do I decode bytes to UTF8? My input is base64-encoded UTF8, but base64/2 assumes ASCII, so I end up with
I tried using open_string, but it doesn’t work:
% "foo('├').".encode('utf8') is the binary string b"foo('\xe2\x94\x9c')"
?- open_string("foo('\xe2\x94\x9c').", S),
read_term(S, Term, []), close(S).
S = <stream>(0x56015498e560),
Term = foo('âx94\234\').
No. A string is a sequence of Unicode code points, not a sequence of bytes. How this is represented is up to Prolog and (thus) you cannot assume this to be a char[] inside.
Indeed, utf8_codes//1 is the way to go for en/decoding short UTF-8 strings as one frequently has to do to to in several protocols. Efficient long byte sequences can be by the memory file library. In most cases though, the I/O functions deal with the en/decoding and issues like these are only required for documents that have multiple encodings inside the same document.
Indeed – by not distinguishing between Unicode code points and bytes, Python 2 got itself all messed up and had to add “byte array” as a primitive type in Python 3 … which broke a lot of code because almost nobody had got it right. utf8_codes//1 is indeed the way to go, but it was hard to find in the manual.