Unicode representation issues with Windows/Wine

jamesnvc · June 25, 2025, 7:50pm

I’m currently trying to fix some issues with the LSP formatting running on Windows. I have a test that tries to round-trip a file, reading it in then writing the parsed form back out. The tests pass on macOS and Linux, but on Windows (tested on Wine, but apparently also on real Windows), the tests fail.

Right now, I’m seeing a failure because of the presence of a multi-byte character (specifically the emoji “”) causes the test to fail. It seems that it reads in as ðŸ‘\u008D but writes out as ðŸ‘\\x8D\\. I’ve tried varying the representation_errors property of both the reading and writing stream, setting the encoding Prolog flag (which seems to just output the invalid character for both bytes & the test fails), and changing the character_escapes_unicode Prolog flag, to no avail.

Removing the character does make the test pass. Any idea where I should be looking to make this work on Windows?

jan · June 25, 2025, 8:11pm

Where are you writing to? SWI-Prolog is supposed to be capable of reading and writing the emoji (or more in general Unicode code points > 0xffff that require UTF-16 surrogate pairs on Windows) correctly from/to streams using UTF-8 or UTF-16 encoding. Notably the console and xpce (GUI) are problematic.

jamesnvc · June 25, 2025, 9:23pm

This is for a plunit test, where I write to current_output, inside with_output_to/2 capturing to a string, which is compared to the result of read_file_to_string/3

j4n_bur53 · June 26, 2025, 6:04am

Close to UTF-8, which gives 4 bytes for the range 0x10000 - 0x10FFFF:

0xF0 0x9F 0x91 0x8D

https://www.compart.com/en/unicode/U+1F44D

Only Ÿ‘ doesn’t match the two inner bytes, so still nonsense.

jan · June 26, 2025, 8:02am

Can you give the complete code? The write to string seems fine. This is 9.3.25 under wine, where the console is known to fail to handle surrogate pairs, so the toplevel write of S shows the two surrogates. The string S contains the correct character though.

?- with_output_to(string(S), write('\U0001f603')), string_codes(S, Codes).
S = "��",
Codes = [128515].

jamesnvc · June 26, 2025, 10:22am

Sure, it’s this test in the LSP server: lsp_server/test/formatter.plt at b060235d66719133498d8dc10640d7c30691d343 · jamesnvc/lsp_server · GitHub

Running like wine swipl.exe -g run_tests -t halt formatter.plt

jan · June 26, 2025, 1:01pm

Hmmm.

jan@janw (test; master) 11_> ~/src/swipl-devel/build.win64/src/swipl.exe -g run_tests -t halt formatter.plt
[2/2] formatting:Formatting example file ......... passed (0.038 sec)
% All 2 tests passed in 0.127 seconds (0.080 cpu)
jan@janw (test; master) 12_> ~/src/swipl/build.win64/src/swipl.exe -g run_tests -t halt formatter.plt
[2/2] formatting:Formatting example file ......... passed (0.040 sec)
% All 2 tests passed in 0.129 seconds (0.080 cpu)

I see no problem, both with 9.3.25 and 9.2.9. What am I missing?

P.s. wine-10.4 (Staging) on Fedora 42

jamesnvc · June 26, 2025, 1:41pm

Oh, sorry, I’d pushed a reversion of the change that made things fail; can you check out 9aeb63e0466ce623c92069145583615797e06882 before running the test? Sorry about that.

jan · June 26, 2025, 8:23pm

Ok. There were two problems. One is in your code. Your files are UTF-8, but you did not tell that to open/3. On most modern non-Windows systems UTF-8 is default, so that will work fine. On Windows the state is a little more diffuse and under Wine even more. So, you should either use open/4 with the option encoding(utf8) or set the Prolog flag encoding to utf8. This flag acts as default for open/3 calls. The latter is probably the best.

Second, there was an issue in the Windows versions that could cause two identical strings containing UTF-16 surrogates to compare non-equal. Pushed a fix for that.

Thanks for your patience

jamesnvc · June 26, 2025, 9:57pm

Oh great, thank-you! I’d tried setting the encoding to utf-8 in some of my experimenting, but was still seeing failures - presumably due to that surrogate comparison issue - so I never committed it. Thanks again Jan, you’re the best!

Topic		Replies	Views
Unicode code point for U+10000 on Windows OS. Use in string results in Illegal character code Help!	5	1651	March 13, 2021
Windows, unit test fails for semweb:ntriples General	1	376	March 31, 2022
Careful with surrogates Help!	3	535	July 9, 2022
JSON output with emoiji General	2	148	February 19, 2024
Alpha-- support for UTF-16: emoji on Windows Announce	0	578	June 29, 2022

Unicode representation issues with Windows/Wine

Related topics