Unicode representation issues with Windows/Wine

I’m currently trying to fix some issues with the LSP formatting running on Windows. I have a test that tries to round-trip a file, reading it in then writing the parsed form back out. The tests pass on macOS and Linux, but on Windows (tested on Wine, but apparently also on real Windows), the tests fail.

Right now, I’m seeing a failure because of the presence of a multi-byte character (specifically the emoji “:+1:”) causes the test to fail. It seems that it reads in as ðŸ‘\u008D but writes out as ðŸ‘\\x8D\\. I’ve tried varying the representation_errors property of both the reading and writing stream, setting the encoding Prolog flag (which seems to just output the invalid character for both bytes & the test fails), and changing the character_escapes_unicode Prolog flag, to no avail.

Removing the character does make the test pass. Any idea where I should be looking to make this work on Windows?

Where are you writing to? SWI-Prolog is supposed to be capable of reading and writing the emoji (or more in general Unicode code points > 0xffff that require UTF-16 surrogate pairs on Windows) correctly from/to streams using UTF-8 or UTF-16 encoding. Notably the console and xpce (GUI) are problematic.

This is for a plunit test, where I write to current_output, inside with_output_to/2 capturing to a string, which is compared to the result of read_file_to_string/3

Close to UTF-8, which gives 4 bytes for the range 0x10000 - 0x10FFFF:

0xF0 0x9F 0x91 0x8D

https://www.compart.com/en/unicode/U+1F44D

Only Ÿ‘ doesn’t match the two inner bytes, so still nonsense.

Can you give the complete code? The write to string seems fine. This is 9.3.25 under wine, where the console is known to fail to handle surrogate pairs, so the toplevel write of S shows the two surrogates. The string S contains the correct character though.

?- with_output_to(string(S), write('\U0001f603')), string_codes(S, Codes).
S = "��",
Codes = [128515].

Sure, it’s this test in the LSP server: lsp_server/test/formatter.plt at b060235d66719133498d8dc10640d7c30691d343 · jamesnvc/lsp_server · GitHub

Running like wine swipl.exe -g run_tests -t halt formatter.plt

Hmmm.

jan@janw (test; master) 11_> ~/src/swipl-devel/build.win64/src/swipl.exe -g run_tests -t halt formatter.plt
[2/2] formatting:Formatting example file ......... passed (0.038 sec)
% All 2 tests passed in 0.127 seconds (0.080 cpu)
jan@janw (test; master) 12_> ~/src/swipl/build.win64/src/swipl.exe -g run_tests -t halt formatter.plt
[2/2] formatting:Formatting example file ......... passed (0.040 sec)
% All 2 tests passed in 0.129 seconds (0.080 cpu)

I see no problem, both with 9.3.25 and 9.2.9. What am I missing?

P.s. wine-10.4 (Staging) on Fedora 42

Oh, sorry, I’d pushed a reversion of the change that made things fail; can you check out 9aeb63e0466ce623c92069145583615797e06882 before running the test? Sorry about that.

Ok. There were two problems. One is in your code. Your files are UTF-8, but you did not tell that to open/3. On most modern non-Windows systems UTF-8 is default, so that will work fine. On Windows the state is a little more diffuse and under Wine even more. So, you should either use open/4 with the option encoding(utf8) or set the Prolog flag encoding to utf8. This flag acts as default for open/3 calls. The latter is probably the best.

Second, there was an issue in the Windows versions that could cause two identical strings containing UTF-16 surrogates to compare non-equal. Pushed a fix for that.

Thanks for your patience

1 Like

Oh great, thank-you! I’d tried setting the encoding to utf-8 in some of my experimenting, but was still seeing failures - presumably due to that surrogate comparison issue - so I never committed it. Thanks again Jan, you’re the best!

1 Like