Default locale and encoding settings on macOS

Follow-up to Swiplserver: load existing prolog file into server - #3 by jan - splitting it into its own thread, because my reply is long and not really related to swiplserver.

Inside Terminal.app, at least some locale variables should be set. There is a setting to disable this (Terminal > Preferences…, Profiles > Advanced > Set locale environment variables on startup), but I think it’s enabled by default. So the average terminal should have locale information, unless you go out of your way to disable it.

With the setting enabled, you should get something like LANG=de_DE.UTF-8. The language part comes from the system language/region settings, and the encoding part comes from the Terminal.app settings. You can choose a different encoding right above where you can disable the locale variables, but I don’t know why you’d use anything other than the default UTF-8.

Interestingly, if I change my system language (no matter which) and then restart Terminal.app, it sets a different variable - LANG becomes unset, and instead it just sets LC_CTYPE=UTF-8. If I switch back to German, it sets LANG=de_DE.UTF-8 again. I’m not sure why it does this, but my guess is that it’s because I changed the language in a running system, and that it would set LANG=en_US.UTF-8 once I relog/restart.

In any case, the variables are set by Terminal.app itself, not by any shell profile. You can see this if you ask Terminal.app to run printenv without a shell (Shell > New Command…) - LANG/LC_CTYPE is still set. But you are correct that they are only set inside Terminal.app, not in a regular macOS GUI application. If you run printenv from SWI-Prolog.app, no locale variables are set:

?- use_module(library(process)).
?- process_create(path(printenv), [], [stdout(pipe(_Stdout))]), read_stream_to_codes(_Stdout, _Codes), string_codes(String, _Codes).

I don’t know if there’s any way to make macOS set locale variables inside GUI applications. My guess is that there isn’t, and that they expect you to use the native macOS locale and encoding handling APIs instead, like CFString and CFLocale (and their Objective-C/Swift counterparts).

However, it seems that macOS tries to avoid the concept of a “default encoding”. The closest thing is CFStringGetSystemEncoding, but the documentation warns that it’s only meant for working with legacy Mac OS APIs that don’t use Unicode, and that your application code should use a different encoding. And indeed, on my system, this function returns kCFStringEncodingMacRoman, which was the system encoding of German Classic Mac OS, but is basically never used on modern macOS.

So I don’t think it makes much sense to try to determine a “system text encoding” on macOS, at least not outside of a terminal where you have Unix-style locale info. I wonder if SWI on macOS should simply default to utf8, as that is the de facto standard (on macOS and elsewhere). At the moment, it defaults to text and behaves like Latin-1, which is basically never the right choice on macOS.

Thanks for the insights. I think my comments were a bit pessimistic and referred to older versions of MacOS. Good to hear that has improved a bit. I just removed LANG from my .zshrc and can confirm the settings. It also seems that UTF-8 handling works out of the box with just LC_CTYPE=UTF-8. It indeed sets the encoding to text, which leaves the translation to the C99 wide character primitives. That is because it only switched to explicit utf8 if LANG ends in .UTF-8 (note the dot).

I guess to be a real Mac app we should probably use CFString and CFLocale . I don’t really fancy having a dual implementation or invent some abstraction over the MacOS API, C99 wide character handling and all Prolog’s internals for Unicode handling :frowning: MacOS does have the complete C locale API, so let us hope it works as advertised :slight_smile:

Something more practical … Ideally we should probably mimic Terminal.app to initialize LANG and/or LC_CTYPE. Do you have an idea how they derive LANG=de_DE.UTF-8? If not, I propose two things:

  1. Make Prolog set LC_CTYPE=UTF-8 if no LANG or LC_CTYPE are defined. That should tell processes Prolog creates that Prolog assumes UTF-8.
  2. Extend the automatic setting for the encoding flag to initialize UTF-8 handling if LC_CTYPE equals UTF-8.

The encoding part comes directly from the Terminal.app preferences. Outside of a terminal, there is no user encoding preference, so the application has to choose its own encoding.

The language/region part comes from the user’s global settings. I think you can get an identical string using CFLocaleCopyCurrent and then CFLocaleGetIdentifier. Does SWI actually need this though? I thought SWI used the locale APIs only for encoding handling (if not set to UTF-8).

That seems reasonable. There is also the more conservative option of allowing only ASCII if no locale info is set - that way you get a hard error instead of silently mis-encoded text. This is what Python did for a long time, but they are now also moving to assuming UTF-8.

That would definitely be good. From what I can tell, LC_CTYPE=UTF-8 is not quite standard (it should be LC_CTYPE=C.UTF-8), but we have to live with what Terminal.app gives us.

No. It has more locale support. Think of format_time/3, format/2,3 using the : modifier for numerical arguments, locale_sort/2 and the message system listens to LC_MESSAGES (allowing for plugin message files that produce system and application messages in other languages).

I see enough complaints resulting from not handling UTF-8 to make it default. Note that if people use a different locale they will get warnings and errors about illegal UTF-8 sequences that should give a hint.
That seems Objective-C. I found CFLocale.c. Seems rather complicated :frowning:

1 Like

Core Foundation is a pure C API (the Objective-C version is Foundation). Yeah, the API is a bit complex, but it’s tolerable. I wouldn’t recommend looking at the implementation code though, it’s much worse than the API :slight_smile:

Here’s a “short” example to get the current locale identifier as a C string:

#include <CoreFoundation/CoreFoundation.h>
#include <stdio.h>
#include <stdlib.h>

static char *cStringFromCFString(CFStringRef string, CFStringEncoding encoding) {
	size_t bufferSize = CFStringGetMaximumSizeForEncoding(CFStringGetLength(string), encoding) + 1;
	char *buffer = malloc(bufferSize);
	if (buffer == NULL) {
		return NULL;
	}
	if (!CFStringGetCString(string, buffer, bufferSize, encoding)) {
		free(buffer);
		return NULL;
	}
	return buffer;
}

int main(int argc, char **argv) {
	CFLocaleRef currentLocale = CFLocaleCopyCurrent();
	CFStringRef identifier = CFLocaleGetIdentifier(currentLocale);
	char *buffer = cStringFromCFString(identifier, kCFStringEncodingUTF8);
	if (buffer != NULL) {
		printf("Current locale identifier: %s\n", buffer);
		free(buffer);
	}
	CFRelease(currentLocale);
}
$ clang -o get_locale_identifier get_locale_identifier.c -framework CoreFoundation
$ ./get_locale_identifier
Current locale identifier: de_DE

Thanks a lot. Good to see we’ve got someone with knowledge of the MacOS internals in our middle. I’ll try not to push my luck :slight_smile: You may want to have an opinion on one other issue though: what should be the initial working directory of the app? Now it is /, which seems bad. $HOME? Or does Apple define some app specific data directory that we can use?

Yeah, I noticed that. Where does the / CWD come from - does swipl-win set it manually, or is that what macOS uses as the default directory for GUI apps? (I actually don’t know what the default CWD is, heh.) The home directory would be fine, since that’s the default directory in the shell as well. FWIW, Python’s GUI shell (IDLE) defaults to ~/Documents.

Well, there is ~/Library/Application Support, where apps usually create their own subfolder to store data. It’s basically the Mac-native version of $XDG_DATA_HOME (~/.local/share). But you wouldn’t use that as the default working directory, because users normally don’t use Application Support directly - it’s just an internal storage location for app-specific data and state. The user’s files go into Documents, which is probably why Python chose that as the default directory :slight_smile:

1 Like

The app does nothing (AFAIK), so I guess it comes from MacOS. If Python uses ~/Documents, why shouldn’t we use the same? Anyone knows what similar applications do, like RStudio, etc.?

Pushed stuff to address this. It is not really satisfactory though. The current locale identifier turns out to be the primary language “_” Region code. I had my Mac setup using English (UK) and region “Netherlands”, resulting in en_NL, which is not a locale :frowning: Anyway, you can on MacOS get this value now using apple_current_locale_identifier/1.

The app now sets LC_CTYPE to UTF-8 if no LANG` or LC_CTYPE`` is present. That seems to work fine.

Second problem is setting the app working directory to ~/Documents This works fine, but accessing any file there results in “SWI-Prolog would like to access files in your Documents folder”. You can confirm this. Unfortunately any subsequent access to this folder results in the same prompt :frowning: There seem to be no obvious way to allow this permanently :frowning:

1 Like

Ah, that’s unfortunate… and also not an easy problem to solve I think - if macOS only supports matching language/region combinations as locales, then there’s no way to faithfully represent something like English (UK)/Netherlands. You would probably have to settle for en_GB, which would give the right language, but with perhaps sometimes wrong regional formats.

Alternatively, you could try to be smart and set the different LC_... variables as appropriate, e. g. LC_MONETARY and LC_NUMERIC to nl_NL and everything else to en_GB. But implementing that properly might be more trouble than it’s worth - and apparently even Apple’s own terminal doesn’t bother and sets only LC_CTYPE in that case.

Ah, I see you’re using macOS 10.15 or newer :slight_smile: As far as I know, the system is supposed to remember which apps you’ve given access to your Documents, so the prompt should only appear once. But clearly that doesn’t always work right. I guess the home folder might be a better choice after all then?

1 Like

I’d also expected this only once. That would be good. And yes, I’m running Monterey (12.1). Good news is that $HOME doesn’t suffer from this so that is what it will be for now. Users can tweak in their init file.

I’ve now also fixed the locale using this sequence

  1. If neither LANG not LC_CTYPE is given, set LC_CTYPE to UTF-8 and Prolog’s encoding to utf8. That is the deep down early initialization.
  2. At the Prolog level if we are in the app, If LocaleIdentifier.UTF-8 is a valid locale, set LANG to this locale and unset LC_CTYPE.

LC_CTYPE=UTF-8 is not innocent as it upsets the X11 initialization because it is not a valid locale and thus the process falls back to the C locale :frowning:

For short, it now works as one would expect if the Mac Language & Region are set to a combination that encodes a valid locale. That is at least a lot better then it used to be :slight_smile:

1 Like

At some time before I inserted lines as below for utf-8 use. I had been annoyed by many locale related troubles before that. I did inserted the lines crossing fingers as oracle. Thanks to the great magical power, the troubles disappeared, though still I have no idea about what the lines are doing. (macOS monterey)

In ~/.zshrc

export LC_ALL=ja_JP.UTF-8
export LANG=ja_JP.UTF-8

Particularly for use of ediprolog mode.

In ~/.emacs.d/init.el

(set-language-environment "Japanese")
(prefer-coding-system 'utf-8)
(set-default-coding-systems 'utf-8)
(set-language-environment 'utf-8)
(set-selection-coding-system 'utf-8)
(set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-buffer-file-coding-system 'utf-8)
(setq default-buffer-file-coding-system 'utf-8)

;; for lualatex.
(setenv "LC_ALL" "ja_JP.UTF-8")
(setenv "LANG" "ja_JP.UTF-8")

I wouldn’t be surprised if the only thing really needed is export LANG=ja_JP.UTF-8 and this should probably even not be needed anymore if Language & Region are set properly because that causes Terminal.app to set LANG=ja_JP.UTF-8. I’d assume that emacs will do all these things if the locale is UTF-8 based.

The LANG and a bunch of LC_* variables initialize the POSIX/C runtime internationalization and character encoding system. AFAIK the LANG sets the default for all aspects. The individual LC_* variables can be used to select different conventions for date/time, numbers, messages, etc. Apple has a completely different infrastructure for these problems.

This discussion was first of all about SWI-Prolog.app, the binary distribution you can download from the site. Where Terminal.app sets up an environment that allows applications that rely on the C runtime library to work out of the box (often, not always :frowning: ), a MacOS app expects you to use Apple’s infrastructure. Thanks to @dgelessus we now mimic the work Terminal.app does such that the SWI-Prolog app should gets it internationalization support right when possible without user interference.