Supporting Unicode command line arguments in Windows

In a project I launch SWI-Prolog from another program using command line parameters -x <file-in-install-folder> and -f <file-in-temp-folder>.

Currently I have no influence on the installation and temporary paths on the user’s side, so they can contain any Unicode character that are valid in paths. Unfortunately, SWI-Prolog does not find some paths with Unicode characters.

I checked the source code of SWI-Prolog. Currently, the entry point is main(int argc, char **argv), which means on Windows that the command line arguments are received encoded in the code page defined by the system settings. For example per default in German/English Windows installations it is Latin-1 and in Japanese Windows installations it is Shift-JIS. I don’t know if I have understood the source code correctly, but it seems that in general the command line arguments (on Windows) are interpreted as Latin-1. If that is true, it means that paths with non-ASCII characters are only processed correctly, when the system setting is set to Latin-1 and the paths only contain Unicode characters that are contained in Latin-1 code page.

One easy solution would be to use the non-standard Windows specific wmain(int argc, wchar_t **argv) in Windows build. But then it is not clear how to process them. PL_initialise also wants char ** as argument type. So perhaps it needs a previous conversion to UTF-8, or a PL_initialise(int argc, wchar_t **argv) would be needed.

So are there any plans to support Unicode command line arguments in Windows?

I’m not a Linux/Unix expert, so I don’t know if that problem also applies there.

Considering the way things work wrt OS strings on Windows, using wmain() and translating to UTF-8 is the most obvious way to handle this. Is this as simple as using wmain() rather than main()? I’ll have a look.

In Unix systems all OS calls that have string arguments use char*. The encoding may differ depending on the locale setup, but most modern systems are setup to use a UTF-8 based locale.

Ended up adding PL_winitialise() to avoid getting too many dependencies into pl-main.c. Seems to work now (tested only minimally using Wine under Linux). Should be in tomorrow’s daily build.

I do some tests and unfortunately it does not work. As an example I tested to call Prolog using arguments -l täst.p.

I analysed the source code and I think it cannot work.

All command line arguments given to PL_initialise(int argc, char **argv) are saved unchanged in some data structures. When the command line options are fetched using $cmd_option_val (in pl-prims.c) an atom is created from that data without any conversion. So the atom contains the arguments still in UTF-8 (before the fix it was the current code page of Windows).

Then the path argument atom (in the mentioned example the script file argument) will be converted using prolog_to_os_filename (pl-files.c). There it extracts the atom text using PL_get_wchars. According the internally called function get_atom_ptr_text, there are two possible encodings: ENC_ISO_LATIN_1 for “normal” atoms and ENC_WCHAR for Unicode atoms. In that example, it uses ENC_ISO_LATIN_1. So the conversion to wchar_t is done wrong: In the example the atom contains the two bytes C3 A4 for the character ä (so correct UTF-8). The converted wchar_t however contains the wrong wide characters 00C3 00A4 instead of 00E4. So with the fix no path with non-ASCII characters can be used anymore.

As I’ve assumed in the initial message, the command line arguments are internally always be processed as Latin-1-encoded in Windows and perhaps also in Linux.

So to fix this, $cmd_option_val should create Unicode atoms instead for Windows.

Thanks for the details and the analysis. I only verified current_prolog_flag/2 using argv passed the right thing. The C parsed command line options are handled differently, so I also applied the encoding conversion there. As suspected, this also prevented the Linux version to run swipl -l file.pl if the file name contains non-ASCII characters.

Next try tomorrow (or you should compile from source).

I did some more tests also with other Unicode characters and I can confirm that the new nightly build is working. Perhaps the documentation should state that in Windows PL_initialise expects UTF-8 encoded characters now.