Unicode symbols, and a possible numerical paradox!

You can do it in Prolog itself, here is a UTF-32 to UTF-16 converter:

convert_list([]) --> [].
convert_list([X|Y]) --> convert_elem(X), convert_list(Y).

convert_elem(X) --> {X =< 0xFFFF}, !, [X].
convert_elem(X) --> {high_surrogate(X, Y), low_surrogate(X, Z)}, [Y, Z].

high_surrogate(X, Y) :- Y is X>>10+0xD7C0.
low_surrogate(X, Y) :- Y is X/\0x3FF+0xDC00.

Works fine in WSL:

$ swipl
Welcome to SWI-Prolog (threaded, 64 bits, version 7.6.4)

?- set_prolog_flag(double_quotes, codes).
true.

?- X = "Flower 💐", convert_list(X, Y, []).
X = [70, 108, 111, 119, 101, 114, 32, 128144],
Y = [70, 108, 111, 119, 101, 114, 32, 55357, 56464].

But I guess the Windows platform and its Wxxx APIs are not
the problem. I find rather the Unix implementation of SWI-Prolog
problematic. Again testing WSL, this doesn’t give true:

?- X = 'Flower 💐', Y = 'Flower \xd83d\\xdc90\', X == Y.
false.

If your system works with UTF-16 strings or with CESU-8 strings
(an UTF-8 variant that allows surrogate pairs), the above might give true.
On the recent windows version of SWI-Prolog the above query

on the other hand works. I suspect it works because I hypothesize
that the command line seems to deliver UTF-16. I get this here:

?- X = 'Flower 💐', Y = 'Flower \ud83d\udc90', X == Y.
X = Y, Y = 'Flower \uD83D\uDC90'.

Here is a screenshot as a proof:

But still there is an annoyance involved. I get these \u in the quoted string
with the recent version of SWI-Prolog on Windows 10, because surrogate
pairs are not understood:

?- X = 'Flower 💐', write(X), nl.
Flower 💐
X = 'Flower \uD83D\uDC90'.

In Dogelog Player I get the following result in the browser, no escapes
in the string, because the flower Unicode code point is judged as printable:

?- X = 'Flower 💐', write(X), nl.
Flower 💐
X = 'Flower 💐'.

This is related to this question here:

Print vs quoted output of Unicode characters
https://swi-prolog.discourse.group/t/print-vs-quoted-output-of-unicode-characters/5546/2

On a Unix system the console produces a multibyte string in the current locale encoding, these days almost invariable UTF-8. That is read and represented as wchar_t array, which is effectively UCS-4 on anything but Windows. An atom is sequence of Unicode code points, and thus query boils down to

?- X = 'Flower \U0001f490', Y = 'Flower \xd83d\\xdc90\', X == Y.

On Windows the story is different because wchar_t is 2 bytes using UTF-16 encoding. SWI-Prolog doesn’t know about this and interprets it as UCS-2, considering the surrogate pairs as characters as reserved/unassigned.

It might make sense to stop escaping the surrogate pairs. It is still fundamentally wrong though. An atom should logically be a sequence of code points, not some internal encoding. It is not really clear whether stopping to escape surrogate pairs makes things better. It looks better, but hides the internal problem.

The real solution is to get rid of the UCS-2 on Windows. There are two obvious routes for that: go UTF-8 on all platforms internally or go UCS-1+UCS-4 as we use on Unix also on Windows. For most of the code using UTF-8 is probably the simplest way out and, as it makes all platforms internally the same, probably also the safest route. It also seems to be the choice of most new languages/implementations. Luckily SWI-Prolog has always hidden encoding from the user, so we can change it without affecting applications.

The main problem are predicates such as sub_atom/5. These get seriously slower on long atoms/strings and more complicated. Using UTF-16 on Windows is another route, introducing the same complications for sub_atom/5 and other routines that handle strings as an array of code points.

The notion of “character codes” is also affected if you
broaden the strings/atoms.

Do you have different code_type/2 implementation on Windows
and non-Windows platforms? I get this here, was trying the
new umap that can do digit:

/* Windows 10, SWIPL 8.5.13 */
?- code_type(0x1f490, X).
X = to_lower(62608) ;
X = to_upper(62608) ;
false.

/* WSL, SWIPL 7.6.4 */
?- code_type(0x1f490, X).
X = prolog_symbol ;
X = graph ;
X = punct ;
X = to_lower(128144) ;
X = to_upper(128144) ;
false.

Does it silently only take the lower 16-bit on Windows from
the given code value? It sems so:

?- X = 0xf490.
X = 62608.

That made me wonder. The case conversion uses towupper() and towlower(). These take wint_t input and output. Now, on Windows sizeof(wint_t) is guess guess guess … yes! 2 bytes! That is the case for all the isw* and tow* functions. So, none of the Windows C runtime wide character routines can reason about code points > 0xffff. Here surrogate pairs are not going to help :frowning:

The native Windows API provides CharLowerW() which takes a wchar_t*, with the very strange semantics that if the pointer is <= 0xffff the argument is taken to be a character, returning a “pointer” that actually contains a code point … That doesn’t help either. Maybe it works correctly if you actually use the string API and give it a surrogate pair? The docs do not give a hint and seem to suggest Microsoft is not aware of surrogate pairs. Well, I know it is not that bad.

There is more wrong with Windows than one would hope :frowning:

\documentclass[12pt]{article}

\usepackage{xltxtra}
\usepackage[none]{hyphenat}
\usepackage{hyperref}
\usepackage{tikz}
\usepackage{amssymb}
\usepackage{polyglossia}
 \setmainlanguage{english}
 \setotherlanguages{sanskrit}
 \newfontfamily\devanagarifont[Script=Devanagari]{Noto Serif Devanagari}

\newcommand{\zdn}[1]{\begin{sanskrit}#1\end{sanskrit}}
\newcommand{\zcode}[1]{\textasciigrave{}#1\textasciigrave{}}
\newcommand{\crossschwa}[1]{\lower.05cm\hbox{\rotatebox{45}{\begin{tikzpicture}[#1]
 \draw (-1,1) -- (-1,0) -- (1,0) -- (1,-1);
 \draw (-1,-1) -- (0,-1) -- (0,1) -- (1,1);
 \fill (-0.5,-0.5) circle[radius=2.25pt];
 \fill (-0.5,0.5) circle[radius=2.25pt];
 \fill (0.5,-0.5) circle[radius=2.25pt];
 \fill (0.5,0.5) circle[radius=2.25pt];
\end{tikzpicture}}}} %https://tex.stackexchange.com/q/263269

\title{issue(115).}
\author{\crossschwa{scale=0.16} \zdn{\char"0936\char"094D\char"092F\char"093E\char"092E} \crossschwa{scale=0.16}}
\date{HE 12022.06.28}

\begin{document}
\maketitle
\abstract{\zdn{\char"0950} In this \XeLaTeX{} document, I present my work in brainstorming how Prolog handles Indic scripts; for a second time, in response to ``the daily build swipl-w64-2022-06-24.exe''. Noto Serif Devanagari is the best font in the Raspberry Pi repos. All content used for academic purposes.}
\tableofcontents

\section{I remain skeptical \#commentreasure} % @the_speed_of_light @lightspeed @ftl}
%                                           ^^^^^                   ^           ^   ^
%                                           │││││                   │           │   └ some incursive recursive madness i'll take care of later; make a date 'n' every"-thing" self settles
%                                           │││││                   └───────────┴ my future selves's optimised models
%                                           ││││└ my derelict dialect; codenamed "mirage", pronounced "मी रैज् ॥"
%                                           ││└┼ me cast adrift; "are we there yet?"@log,_like_a_pro;_err…
%                                           │└─┴ space
%                                           └ airlock
This seems shallow.
What happens for \texttt{`is/2`}?
I can't imagine anyone wanting to input one script, and outputting another; from a hacking perspective.
My RFC was they should be equal (\texttt{`==/2`}), but should not unify (\texttt{`=/2`}); now I realise this is a deeper issue, but the (more) informative error message from my RFC still stands, or the inconsistency from the prior section looks buggy, but Perl taught me bugs are byproducts of features, and now I've adapted this philosophy to Prolog.

\subsection{Compromisation}
My compromise is to just use Prolog the same as before I started learning Sanskrit, and do some"-thing" along the lines of this:
\\\texttt{devnum(D,N) :- ground(N), nth0(N,[}\zdn{०}\texttt{,}\zdn{१}\texttt{,}\zdn{२}\texttt{,}\zdn{३}\texttt{,}\zdn{४}\texttt{,}\zdn{५}\texttt{,}\zdn{६}\texttt{,}\zdn{७}\texttt{,}\zdn{८}\texttt{,}\zdn{९}\texttt{],D). \%commentreasure} %D never needs grounding, since D becomes N through unification; I'm not convinced this will even work as intended, but was planning on using atom_chars(Atomic_D,Lithp_D)

\section{I reject your' reality, and substitute my own!!! :D \#eject\_2:\_infinity,\_and\_beyond!!!\_:D}
I'm going to hack my own self-hosted dialect; reinventing each, and every, (key) component part of each, and every, wheel, while inventing some new ones, and \texttt{s/getting/setting/} the time right (this time), whilst I am, locatively, at it; future proofing it, so it is (well, and truly) ``good for over a hundred years'', or even ``good forever''; even after the heat death of the multiverse, and into the following cyclic multiverses.

\end{document}