Write to socket -- does it strip quotes?

peter.ludemann · February 24, 2020, 5:53pm

UTF-8 vs Unicode code points is a different problem – and it’s not trivial either (Python 2 messed it up so badly that a non-compatible change was made with Python 3, which probably set back Python development by half a decade; and C++ is unlikely to ever have anything better than “be very careful”, and you have to be more disciplined/careful than me to not make a mistake (quick: should I be using “encode”, “decode”, or is my byte string already encoded/decode?).

There’s a 1-1 mapping between UTF-8 and 32-bit Unicode. (I think …)

The “İzmir” example is because Turkish uses combining characters. Here’s more explanation: Issue 34723: lower() on Turkish letter "İ" returns a 2-chars-long string - Python tracker and Issue 12846: unicodedata.normalize turkish letter problem - Python tracker

EricGT · February 24, 2020, 6:19pm

Depends on the encoding. With UTF-8 if you look at the raw bytes then if a value is greater than 255 it is un-encoded (need not be decoded), meaning it is Unicode Code Points. If all of the values are less than 255, then it is probably encoded, but if the original data was just ASCII 7-bit, then it doesn’t matter.

I have been staring at the raw bytes for days and am currently writing a hex dump in Prolog because I am getting tired of running code arrays through string_codes/2 by hand. When the values are hex it is easier for me to identify the data, but when it is all decimal characters I need to use a lookup table.

peter.ludemann · February 24, 2020, 6:42pm

It’s more complicated than that, but I don’t want to get into explaining all the nuances and others have done it better than me anyway. Trust me: I’ve been badly bitten by subtle mistakes in encoding/decoding – e.g.: getting confused about whether an offset is in bytes (for encoded (UTF8) data) or in characters (for decoded (Unicode) data). And if your test data happens to be all ASCII, you won’t find such mistakes (you won’t see any values greater than 127).

The Python documentation is a pretty good introduction, if you substitute Prolog’s list-of-codes for Python’s strings (and you can also use SWI-Prolog’s strings the same way as Python strings).

EricGT · February 24, 2020, 7:02pm

Are you talking about converting UTF-8 to UTF-16 or UTF-32?
UTF-8 → UTF-16
UTF-8 → UTF-32

I am talking about converting UTF-8, or UTF-16 to Unicode Code Points.

UTF-8 → Unicode Code Points
UTF-16 → Unicode Code Points

Topic		Replies	Views
Double_quotes flag and DCGs Discussion dcg	12	764	June 30, 2025
What's the idiomatic way of developing DCGs? Help!	8	679	December 17, 2020
Porting from SICStus Prolog - double-quoted strings and --traditional Help!	4	399	December 2, 2020
DCG translation easter egg in SWI-Prolog General	0	69	January 25, 2025
Split_string: the swiss army knife of string goodness. Removing encapsulating quotes General	6	547	June 2, 2020

Write to socket -- does it strip quotes?

Related topics