Write to socket -- does it strip quotes?

UTF-8 vs Unicode code points is a different problem – and it’s not trivial either (Python 2 messed it up so badly that a non-compatible change was made with Python 3, which probably set back Python development by half a decade; and C++ is unlikely to ever have anything better than “be very careful”, and you have to be more disciplined/careful than me to not make a mistake (quick: should I be using “encode”, “decode”, or is my byte string already encoded/decode?).

There’s a 1-1 mapping between UTF-8 and 32-bit Unicode. (I think …)

The “İzmir” example is because Turkish uses combining characters. Here’s more explanation: Issue 34723: lower() on Turkish letter "İ" returns a 2-chars-long string - Python tracker and Issue 12846: unicodedata.normalize turkish letter problem - Python tracker

1 Like

Depends on the encoding. With UTF-8 if you look at the raw bytes then if a value is greater than 255 it is un-encoded (need not be decoded), meaning it is Unicode Code Points. If all of the values are less than 255, then it is probably encoded, but if the original data was just ASCII 7-bit, then it doesn’t matter.

I have been staring at the raw bytes for days and am currently writing a hex dump in Prolog because I am getting tired of running code arrays through string_codes/2 by hand. When the values are hex it is easier for me to identify the data, but when it is all decimal characters I need to use a lookup table.

It’s more complicated than that, but I don’t want to get into explaining all the nuances and others have done it better than me anyway. Trust me: I’ve been badly bitten by subtle mistakes in encoding/decoding – e.g.: getting confused about whether an offset is in bytes (for encoded (UTF8) data) or in characters (for decoded (Unicode) data). And if your test data happens to be all ASCII, you won’t find such mistakes (you won’t see any values greater than 127).

The Python documentation is a pretty good introduction, if you substitute Prolog’s list-of-codes for Python’s strings (and you can also use SWI-Prolog’s strings the same way as Python strings).

1 Like

Are you talking about converting UTF-8 to UTF-16 or UTF-32?
UTF-8 → UTF-16
UTF-8 → UTF-32

I am talking about converting UTF-8, or UTF-16 to Unicode Code Points.

UTF-8 → Unicode Code Points
UTF-16 → Unicode Code Points