I’m currently trying to fix some issues with the LSP formatting running on Windows. I have a test that tries to round-trip a file, reading it in then writing the parsed form back out. The tests pass on macOS and Linux, but on Windows (tested on Wine, but apparently also on real Windows), the tests fail.
Right now, I’m seeing a failure because of the presence of a multi-byte character (specifically the emoji “”) causes the test to fail. It seems that it reads in as ðŸ‘\u008D but writes out as ðŸ‘\\x8D\\. I’ve tried varying the representation_errors property of both the reading and writing stream, setting the encoding Prolog flag (which seems to just output the invalid character for both bytes & the test fails), and changing the character_escapes_unicode Prolog flag, to no avail.
Removing the character does make the test pass. Any idea where I should be looking to make this work on Windows?
Where are you writing to? SWI-Prolog is supposed to be capable of reading and writing the emoji (or more in general Unicode code points > 0xffff that require UTF-16 surrogate pairs on Windows) correctly from/to streams using UTF-8 or UTF-16 encoding. Notably the console and xpce (GUI) are problematic.
This is for a plunit test, where I write to current_output, inside with_output_to/2 capturing to a string, which is compared to the result of read_file_to_string/3
Can you give the complete code? The write to string seems fine. This is 9.3.25 under wine, where the console is known to fail to handle surrogate pairs, so the toplevel write of S shows the two surrogates. The string S contains the correct character though.
Oh, sorry, I’d pushed a reversion of the change that made things fail; can you check out 9aeb63e0466ce623c92069145583615797e06882 before running the test? Sorry about that.
Ok. There were two problems. One is in your code. Your files are UTF-8, but you did not tell that to open/3. On most modern non-Windows systems UTF-8 is default, so that will work fine. On Windows the state is a little more diffuse and under Wine even more. So, you should either use open/4 with the option encoding(utf8) or set the Prolog flag encoding to utf8. This flag acts as default for open/3 calls. The latter is probably the best.
Second, there was an issue in the Windows versions that could cause two identical strings containing UTF-16 surrogates to compare non-equal. Pushed a fix for that.
Oh great, thank-you! I’d tried setting the encoding to utf-8 in some of my experimenting, but was still seeing failures - presumably due to that surrogate comparison issue - so I never committed it. Thanks again Jan, you’re the best!