Write to socket -- does it strip quotes?

Depends on the encoding. With UTF-8 if you look at the raw bytes then if a value is greater than 255 it is un-encoded (need not be decoded), meaning it is Unicode Code Points. If all of the values are less than 255, then it is probably encoded, but if the original data was just ASCII 7-bit, then it doesn’t matter.

I have been staring at the raw bytes for days and am currently writing a hex dump in Prolog because I am getting tired of running code arrays through string_codes/2 by hand. When the values are hex it is easier for me to identify the data, but when it is all decimal characters I need to use a lookup table.

It’s more complicated than that, but I don’t want to get into explaining all the nuances and others have done it better than me anyway. Trust me: I’ve been badly bitten by subtle mistakes in encoding/decoding – e.g.: getting confused about whether an offset is in bytes (for encoded (UTF8) data) or in characters (for decoded (Unicode) data). And if your test data happens to be all ASCII, you won’t find such mistakes (you won’t see any values greater than 127).

The Python documentation is a pretty good introduction, if you substitute Prolog’s list-of-codes for Python’s strings (and you can also use SWI-Prolog’s strings the same way as Python strings).

1 Like

Are you talking about converting UTF-8 to UTF-16 or UTF-32?
UTF-8 → UTF-16
UTF-8 → UTF-32

I am talking about converting UTF-8, or UTF-16 to Unicode Code Points.

UTF-8 → Unicode Code Points
UTF-16 → Unicode Code Points