Unicode symbols, and a possible numerical paradox!

In general, it’s a mistake to depend on language features for normalizing data (even when setting the localisation environment). I would expect a Japanese-oriented application to normalize full-width numbers to ASCII (e.g., 123 would be normalized to 123) and also to normalize 半角 (han-kaku) to 全角 (zen-kaku) (e.g. ハンカカク to ハンカク), and probably a lot of other normalizations. Presumably there are libraries for handling this. (Normalization rules for other languages have their own nuances; for example Québec and France have slightly different rules for capitalizing a name with accents, such as “Lenôtre”.)

I fully agree with your suggestion. In fact I have been prcticing in that way since a long time ago. I believe also most people are doing so. In fact, for example, it was widely advised by network administrators not to use half-width characters in the mail title, because of high risk on troubles on send/receiveing such series of chars.
There might be many R & D topics related to encoding, which might be still active really. However I escaped from such important and heavy issues. I hope so as I was close to them in the sense of “social distance”. Charcter encoding did not look attractive for me, but now it may be more important technology than I thought. Meta font technology might concern glyphs, for example.

TexEdit seems to open both files in utf8 and utf16 without warings as you told. Thanks.

% echo  漢字 > a 
% nkf -w16 a > b
% open -a TextEdit a
% open -a TextEdit b

Do you see what you want. I see something without meaning :sweat_smile:

% hexdump a
0000000 bce6 e5a2 97ad 000a                    
0000007

% hexdump b
0000000 fffe 226f 575b 0a00                    
0000008
%

Roughly, that would require making something like libiconv or librecode part of the system. While this is typically installed already on most Unix like systems, it is a big new dependency, about as big as SWI-Prolog itself :frowning:

When I added Unicode support I went for built-in support for ISO Latin 1, UTF-8 and USC-2 (big and little endian). In addition there is the text encoding, which uses whatever the locale tells the multibyte routines of the C runtime to support. So, you have always access to files in your default locale, UTF-8 and USC-2 files. As fairly all non-Windows system already use UTF-8 as their primary encoding, this seems the right direction. Windows seems to be moving in the same direction.

Once upon a time UCS-2 should be extended to UTF-16, i.e. we must support surrogate pairs. As is, this is not that useful as we find UTF-16 mostly on Windows and SWI-Prolog does not support code points > 0xffff on Windows.

On a Unix system the console produces a multibyte string in the current locale encoding, these days almost invariable UTF-8. That is read and represented as wchar_t array, which is effectively UCS-4 on anything but Windows. An atom is sequence of Unicode code points, and thus query boils down to

?- X = 'Flower \U0001f490', Y = 'Flower \xd83d\\xdc90\', X == Y.

On Windows the story is different because wchar_t is 2 bytes using UTF-16 encoding. SWI-Prolog doesn’t know about this and interprets it as UCS-2, considering the surrogate pairs as characters as reserved/unassigned.

It might make sense to stop escaping the surrogate pairs. It is still fundamentally wrong though. An atom should logically be a sequence of code points, not some internal encoding. It is not really clear whether stopping to escape surrogate pairs makes things better. It looks better, but hides the internal problem.

The real solution is to get rid of the UCS-2 on Windows. There are two obvious routes for that: go UTF-8 on all platforms internally or go UCS-1+UCS-4 as we use on Unix also on Windows. For most of the code using UTF-8 is probably the simplest way out and, as it makes all platforms internally the same, probably also the safest route. It also seems to be the choice of most new languages/implementations. Luckily SWI-Prolog has always hidden encoding from the user, so we can change it without affecting applications.

The main problem are predicates such as sub_atom/5. These get seriously slower on long atoms/strings and more complicated. Using UTF-16 on Windows is another route, introducing the same complications for sub_atom/5 and other routines that handle strings as an array of code points.

That made me wonder. The case conversion uses towupper() and towlower(). These take wint_t input and output. Now, on Windows sizeof(wint_t) is guess guess guess … yes! 2 bytes! That is the case for all the isw* and tow* functions. So, none of the Windows C runtime wide character routines can reason about code points > 0xffff. Here surrogate pairs are not going to help :frowning:

The native Windows API provides CharLowerW() which takes a wchar_t*, with the very strange semantics that if the pointer is <= 0xffff the argument is taken to be a character, returning a “pointer” that actually contains a code point … That doesn’t help either. Maybe it works correctly if you actually use the string API and give it a surrogate pair? The docs do not give a hint and seem to suggest Microsoft is not aware of surrogate pairs. Well, I know it is not that bad.

There is more wrong with Windows than one would hope :frowning:

\documentclass[12pt]{article}

\usepackage{xltxtra}
\usepackage[none]{hyphenat}
\usepackage{hyperref}
\usepackage{tikz}
\usepackage{amssymb}
\usepackage{polyglossia}
 \setmainlanguage{english}
 \setotherlanguages{sanskrit}
 \newfontfamily\devanagarifont[Script=Devanagari]{Noto Serif Devanagari}

\newcommand{\zdn}[1]{\begin{sanskrit}#1\end{sanskrit}}
\newcommand{\zcode}[1]{\textasciigrave{}#1\textasciigrave{}}
\newcommand{\crossschwa}[1]{\lower.05cm\hbox{\rotatebox{45}{\begin{tikzpicture}[#1]
 \draw (-1,1) -- (-1,0) -- (1,0) -- (1,-1);
 \draw (-1,-1) -- (0,-1) -- (0,1) -- (1,1);
 \fill (-0.5,-0.5) circle[radius=2.25pt];
 \fill (-0.5,0.5) circle[radius=2.25pt];
 \fill (0.5,-0.5) circle[radius=2.25pt];
 \fill (0.5,0.5) circle[radius=2.25pt];
\end{tikzpicture}}}} %https://tex.stackexchange.com/q/263269

\title{issue(115).}
\author{\crossschwa{scale=0.16} \zdn{\char"0936\char"094D\char"092F\char"093E\char"092E} \crossschwa{scale=0.16}}
\date{HE 12022.06.28}

\begin{document}
\maketitle
\abstract{\zdn{\char"0950} In this \XeLaTeX{} document, I present my work in brainstorming how Prolog handles Indic scripts; for a second time, in response to ``the daily build swipl-w64-2022-06-24.exe''. Noto Serif Devanagari is the best font in the Raspberry Pi repos. All content used for academic purposes.}
\tableofcontents

\section{I remain skeptical \#commentreasure} % @the_speed_of_light @lightspeed @ftl}
%                                           ^^^^^                   ^           ^   ^
%                                           │││││                   │           │   └ some incursive recursive madness i'll take care of later; make a date 'n' every"-thing" self settles
%                                           │││││                   └───────────┴ my future selves's optimised models
%                                           ││││└ my derelict dialect; codenamed "mirage", pronounced "मी रैज् ॥"
%                                           ││└┼ me cast adrift; "are we there yet?"@log,_like_a_pro;_err…
%                                           │└─┴ space
%                                           └ airlock
This seems shallow.
What happens for \texttt{`is/2`}?
I can't imagine anyone wanting to input one script, and outputting another; from a hacking perspective.
My RFC was they should be equal (\texttt{`==/2`}), but should not unify (\texttt{`=/2`}); now I realise this is a deeper issue, but the (more) informative error message from my RFC still stands, or the inconsistency from the prior section looks buggy, but Perl taught me bugs are byproducts of features, and now I've adapted this philosophy to Prolog.

\subsection{Compromisation}
My compromise is to just use Prolog the same as before I started learning Sanskrit, and do some"-thing" along the lines of this:
\\\texttt{devnum(D,N) :- ground(N), nth0(N,[}\zdn{०}\texttt{,}\zdn{१}\texttt{,}\zdn{२}\texttt{,}\zdn{३}\texttt{,}\zdn{४}\texttt{,}\zdn{५}\texttt{,}\zdn{६}\texttt{,}\zdn{७}\texttt{,}\zdn{८}\texttt{,}\zdn{९}\texttt{],D). \%commentreasure} %D never needs grounding, since D becomes N through unification; I'm not convinced this will even work as intended, but was planning on using atom_chars(Atomic_D,Lithp_D)

\section{I reject your' reality, and substitute my own!!! :D \#eject\_2:\_infinity,\_and\_beyond!!!\_:D}
I'm going to hack my own self-hosted dialect; reinventing each, and every, (key) component part of each, and every, wheel, while inventing some new ones, and \texttt{s/getting/setting/} the time right (this time), whilst I am, locatively, at it; future proofing it, so it is (well, and truly) ``good for over a hundred years'', or even ``good forever''; even after the heat death of the multiverse, and into the following cyclic multiverses.

\end{document}