Unicode symbols, and a possible numerical paradox!

I agree. But office workers might think differently, because now they may be busy toward paperless offices.

BTW, recently researchers in India expressed thanks to Japanese temples, which still uses Sanscript original letter (梵字?) for some purpose. The letter was lost in India. One of good or bad Japanese traditions is to keep imported things as they are and revise a little bit them keeping respects. I heard that many Chinese Kanji words for modern concepts are added by Japanese (60 % ?) since Meiji (1870).

I hope this perhaps good tradition may not be an obstacle to prevent from shifting to paperless society. This is only my opinion. I never talked to others on this who think differently of course.

Thats not necessary to convert anything to UTF-8, if you have only a single
script aka language inside your text, since most browsers understand for
example SHIFT-JIS. If you encode this here into SHIFT-JIS file:

<html>
   <head>
       <meta http-equiv="content-type" content="text/html; charset=Shift_JIS">
       <title></title>
   </head>
   <body>
     幽霊です
  </body>
</html>

It shows nicely in the browser, although the browser processor character set
is usually Unicode, in particular the browser JavaScript strings are UTF-16,
it nevertheless shows, since the charset is a file system thingy:

image

Thats the file I used:
test.html.log (243 Bytes)

Edit 24.06.2022:
Externally you might also prefer UTF-16 over UTF-8, since
UTF-16: “Bad for English text, good for Asian text.”:

image

Source of this claim, I would rather prefer some scientific statistics, but anyway:

XML 1.1 Bible, 3rd Edition
https://www.wiley.com/en-us/XML+1+1+Bible%2C+3rd+Edition-p-9780764549861

I prefer utf-8 to utf-16 without such statistics. Perhaps I do so because Apple supports utf-8.

BTW a kanji 熊 (bear) is a trauma for me. About 30 years ago, I took a long time for debugging on Prolog codes which simply reads a file, and found finally that Kanj code contains an ascii control code which stopped the program. It was EUC or Shift-JIS encoding, I don’t remember which. I never thought such kind of errors caused by text encoding. Then I heard that UTF-8 has no such problem, which is a simple but main reason that I prefer UTF-8.

Maybe you mean the locale and not the file format?
A UTF-8 based locale is again something else. You can
have UTF-8 based locale and yet use a variety of

file formats on most if not all computers. Seems to be also
applicable to mac, you can store UTF-16 or SHIFT-JIS:

UTF-16 and UTF-8 do store the same Unicode code points, you
wont see any difference, only different file size. Try it!

Disclaimer: I don’t know your tool chains, so my advice might
not work, be careful, make backups before converting stuff.

Edit 24.06.2022:
Here is a test, the plain text from the Japanese Wikipedia page on Prolog
programming language. The compression doesn’t work so good if it is the
HTML, thats why UTF-8 might be prevalent on the web.

But for the plain text one quickly sees a size reduction. The Wikipedia page
has also some latin text in it. Maybe a page with only Japanese shows
a higher compression. Here are the results:

UTF-8 encoded: 38’191 Bytes
UTF-16 encoded: 29’392 Bytes

These are the files I were using:

prolog.jp.utf8.txt.log (37,3 KB)

prolog.jp.utf16.txt.log (28,7 KB)

Thanks for useful comments. By saying that Apple supprts utf-8, I meant that Apple default encoding is utf-8. By “default encoding” I meant that Apple products e.g. TextEdit.app accepts utf-8 file, but I suspect, it does not accept utf-16 file without showing a window on changing encoding, though I have not tested any, but based on experience on macOS. Unfortunately, I have no occasion to touch Windows OS. Also by default any defualt text files are in utf-8 including iOS, iPadOS, and macOS. I guess it is Apple’s policy for its recent universal access functionality.

In general, it’s a mistake to depend on language features for normalizing data (even when setting the localisation environment). I would expect a Japanese-oriented application to normalize full-width numbers to ASCII (e.g., 123 would be normalized to 123) and also to normalize 半角 (han-kaku) to 全角 (zen-kaku) (e.g. ハンカカク to ハンカク), and probably a lot of other normalizations. Presumably there are libraries for handling this. (Normalization rules for other languages have their own nuances; for example Québec and France have slightly different rules for capitalizing a name with accents, such as “Lenôtre”.)

I fully agree with your suggestion. In fact I have been prcticing in that way since a long time ago. I believe also most people are doing so. In fact, for example, it was widely advised by network administrators not to use half-width characters in the mail title, because of high risk on troubles on send/receiveing such series of chars.
There might be many R & D topics related to encoding, which might be still active really. However I escaped from such important and heavy issues. I hope so as I was close to them in the sense of “social distance”. Charcter encoding did not look attractive for me, but now it may be more important technology than I thought. Meta font technology might concern glyphs, for example.

Nope, it will accept an UTF-16 file without some manipulation, and you
don’t need to change the encoding to UTF-8. This works especially
well in case the UTF-16 file has as a first character a BOM. The BOM

is then used to detect the particular type of encoding, for example the
endianness. BOM should be also supported by SWI-Prolog, from within
the open/4 predicate. There are a couple of other Prolog systems that

support it as well. See also here:

2.19.1.1 BOM: Byte Order Mark
BOM or Byte Order Marker is a technique for identifying
Unicode text files as well as the encoding they use.
SWI-Prolog -- Manual

bom(Bool)
the default is true for mode read
https://www.swi-prolog.org/pldoc/doc_for?object=open/4

Disclaimer: Of course you might run into problems if some tool in your tool
chain cannot detect, handle or generate BOMs, so be careful.

Edit 25.06.2022:
The BOM itself is just a Unicode code point, that gets different
byte patterns depending on the encoding. The Unicode code point
has even an entry in the Unicode database:

Zero Width No-Break Space (BOM, ZWNBSP)
https://www.compart.com/de/unicode/U+FEFF

I even read now:

Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII.
https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

So you might find a BOM also for UTF-8 files, not only UTF-16 and other encodings.
The usual heuristic is to choose UTF-8 if there is no BOM, but the above suggests
to throw an error if there is no BOM and some non-ASCII, i.e. > 7 bit.

Don’t know if any Prolog system throws such an error. One could make the
input stream aware that there was no BOM, and then bark on a > 7 bit character
coming from the stream. Could give more security.

Makes me currious to check what Java, JavaScript and Python do, whether
they have some reader object with such a feature. In the worst case one could
realize this feature in the Prolog systems streams itself.

Again bom(Bool) is not mentioned in the ISO Prolog core standard, but
it has already some support across Prolog systems. For example in
formerly Jekejeke Prolog I have the same, not sure what newer Prolog

systems such as Scryer Prolog provide. Need to check.

Edit 25.06.2022:
But I don’t know what mac i.e. apple does. The above is from windows.
This guy here discusses the issue also at length for mac i.e. apple.

Source:

Step into Xcode: MAC OS X Development
https://www.amazon.com/dp/0321334221

TexEdit seems to open both files in utf8 and utf16 without warings as you told. Thanks.

% echo  漢字 > a 
% nkf -w16 a > b
% open -a TextEdit a
% open -a TextEdit b

Do you see what you want. I see something without meaning :sweat_smile:

% hexdump a
0000000 bce6 e5a2 97ad 000a                    
0000007

% hexdump b
0000000 fffe 226f 575b 0a00                    
0000008
%

Just go here, see screenshot below. Now we find:

File a: Has no BOM, you created it with echo.

File b: Has a BOM, FF FE, thats the UTF-16LE BOM.

https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

Unbenannt

Roughly, that would require making something like libiconv or librecode part of the system. While this is typically installed already on most Unix like systems, it is a big new dependency, about as big as SWI-Prolog itself :frowning:

When I added Unicode support I went for built-in support for ISO Latin 1, UTF-8 and USC-2 (big and little endian). In addition there is the text encoding, which uses whatever the locale tells the multibyte routines of the C runtime to support. So, you have always access to files in your default locale, UTF-8 and USC-2 files. As fairly all non-Windows system already use UTF-8 as their primary encoding, this seems the right direction. Windows seems to be moving in the same direction.

Once upon a time UCS-2 should be extended to UTF-16, i.e. we must support surrogate pairs. As is, this is not that useful as we find UTF-16 mostly on Windows and SWI-Prolog does not support code points > 0xffff on Windows.

You can do it in Prolog itself, here is a UTF-32 to UTF-16 converter:

convert_list([]) --> [].
convert_list([X|Y]) --> convert_elem(X), convert_list(Y).

convert_elem(X) --> {X =< 0xFFFF}, !, [X].
convert_elem(X) --> {high_surrogate(X, Y), low_surrogate(X, Z)}, [Y, Z].

high_surrogate(X, Y) :- Y is X>>10+0xD7C0.
low_surrogate(X, Y) :- Y is X/\0x3FF+0xDC00.

Works fine in WSL:

$ swipl
Welcome to SWI-Prolog (threaded, 64 bits, version 7.6.4)

?- set_prolog_flag(double_quotes, codes).
true.

?- X = "Flower 💐", convert_list(X, Y, []).
X = [70, 108, 111, 119, 101, 114, 32, 128144],
Y = [70, 108, 111, 119, 101, 114, 32, 55357, 56464].

But I guess the Windows platform and its Wxxx APIs are not
the problem. I find rather the Unix implementation of SWI-Prolog
problematic. Again testing WSL, this doesn’t give true:

?- X = 'Flower 💐', Y = 'Flower \xd83d\\xdc90\', X == Y.
false.

If your system works with UTF-16 strings or with CESU-8 strings
(an UTF-8 variant that allows surrogate pairs), the above might give true.
On the recent windows version of SWI-Prolog the above query

on the other hand works. I suspect it works because I hypothesize
that the command line seems to deliver UTF-16. I get this here:

?- X = 'Flower 💐', Y = 'Flower \ud83d\udc90', X == Y.
X = Y, Y = 'Flower \uD83D\uDC90'.

Here is a screenshot as a proof:

But still there is an annoyance involved. I get these \u in the quoted string
with the recent version of SWI-Prolog on Windows 10, because surrogate
pairs are not understood:

?- X = 'Flower 💐', write(X), nl.
Flower 💐
X = 'Flower \uD83D\uDC90'.

In Dogelog Player I get the following result in the browser, no escapes
in the string, because the flower Unicode code point is judged as printable:

?- X = 'Flower 💐', write(X), nl.
Flower 💐
X = 'Flower 💐'.

This is related to this question here:

Print vs quoted output of Unicode characters
https://swi-prolog.discourse.group/t/print-vs-quoted-output-of-unicode-characters/5546/2

On a Unix system the console produces a multibyte string in the current locale encoding, these days almost invariable UTF-8. That is read and represented as wchar_t array, which is effectively UCS-4 on anything but Windows. An atom is sequence of Unicode code points, and thus query boils down to

?- X = 'Flower \U0001f490', Y = 'Flower \xd83d\\xdc90\', X == Y.

On Windows the story is different because wchar_t is 2 bytes using UTF-16 encoding. SWI-Prolog doesn’t know about this and interprets it as UCS-2, considering the surrogate pairs as characters as reserved/unassigned.

It might make sense to stop escaping the surrogate pairs. It is still fundamentally wrong though. An atom should logically be a sequence of code points, not some internal encoding. It is not really clear whether stopping to escape surrogate pairs makes things better. It looks better, but hides the internal problem.

The real solution is to get rid of the UCS-2 on Windows. There are two obvious routes for that: go UTF-8 on all platforms internally or go UCS-1+UCS-4 as we use on Unix also on Windows. For most of the code using UTF-8 is probably the simplest way out and, as it makes all platforms internally the same, probably also the safest route. It also seems to be the choice of most new languages/implementations. Luckily SWI-Prolog has always hidden encoding from the user, so we can change it without affecting applications.

The main problem are predicates such as sub_atom/5. These get seriously slower on long atoms/strings and more complicated. Using UTF-16 on Windows is another route, introducing the same complications for sub_atom/5 and other routines that handle strings as an array of code points.

The notion of “character codes” is also affected if you
broaden the strings/atoms.

Do you have different code_type/2 implementation on Windows
and non-Windows platforms? I get this here, was trying the
new umap that can do digit:

/* Windows 10, SWIPL 8.5.13 */
?- code_type(0x1f490, X).
X = to_lower(62608) ;
X = to_upper(62608) ;
false.

/* WSL, SWIPL 7.6.4 */
?- code_type(0x1f490, X).
X = prolog_symbol ;
X = graph ;
X = punct ;
X = to_lower(128144) ;
X = to_upper(128144) ;
false.

Does it silently only take the lower 16-bit on Windows from
the given code value? It sems so:

?- X = 0xf490.
X = 62608.

That made me wonder. The case conversion uses towupper() and towlower(). These take wint_t input and output. Now, on Windows sizeof(wint_t) is guess guess guess … yes! 2 bytes! That is the case for all the isw* and tow* functions. So, none of the Windows C runtime wide character routines can reason about code points > 0xffff. Here surrogate pairs are not going to help :frowning:

The native Windows API provides CharLowerW() which takes a wchar_t*, with the very strange semantics that if the pointer is <= 0xffff the argument is taken to be a character, returning a “pointer” that actually contains a code point … That doesn’t help either. Maybe it works correctly if you actually use the string API and give it a surrogate pair? The docs do not give a hint and seem to suggest Microsoft is not aware of surrogate pairs. Well, I know it is not that bad.

There is more wrong with Windows than one would hope :frowning:

\documentclass[12pt]{article}

\usepackage{xltxtra}
\usepackage[none]{hyphenat}
\usepackage{hyperref}
\usepackage{tikz}
\usepackage{amssymb}
\usepackage{polyglossia}
 \setmainlanguage{english}
 \setotherlanguages{sanskrit}
 \newfontfamily\devanagarifont[Script=Devanagari]{Noto Serif Devanagari}

\newcommand{\zdn}[1]{\begin{sanskrit}#1\end{sanskrit}}
\newcommand{\zcode}[1]{\textasciigrave{}#1\textasciigrave{}}
\newcommand{\crossschwa}[1]{\lower.05cm\hbox{\rotatebox{45}{\begin{tikzpicture}[#1]
 \draw (-1,1) -- (-1,0) -- (1,0) -- (1,-1);
 \draw (-1,-1) -- (0,-1) -- (0,1) -- (1,1);
 \fill (-0.5,-0.5) circle[radius=2.25pt];
 \fill (-0.5,0.5) circle[radius=2.25pt];
 \fill (0.5,-0.5) circle[radius=2.25pt];
 \fill (0.5,0.5) circle[radius=2.25pt];
\end{tikzpicture}}}} %https://tex.stackexchange.com/q/263269

\title{issue(115).}
\author{\crossschwa{scale=0.16} \zdn{\char"0936\char"094D\char"092F\char"093E\char"092E} \crossschwa{scale=0.16}}
\date{HE 12022.06.28}

\begin{document}
\maketitle
\abstract{\zdn{\char"0950} In this \XeLaTeX{} document, I present my work in brainstorming how Prolog handles Indic scripts; for a second time, in response to ``the daily build swipl-w64-2022-06-24.exe''. Noto Serif Devanagari is the best font in the Raspberry Pi repos. All content used for academic purposes.}
\tableofcontents

\section{I remain skeptical \#commentreasure} % @the_speed_of_light @lightspeed @ftl}
%                                           ^^^^^                   ^           ^   ^
%                                           │││││                   │           │   └ some incursive recursive madness i'll take care of later; make a date 'n' every"-thing" self settles
%                                           │││││                   └───────────┴ my future selves's optimised models
%                                           ││││└ my derelict dialect; codenamed "mirage", pronounced "मी रैज् ॥"
%                                           ││└┼ me cast adrift; "are we there yet?"@log,_like_a_pro;_err…
%                                           │└─┴ space
%                                           └ airlock
This seems shallow.
What happens for \texttt{`is/2`}?
I can't imagine anyone wanting to input one script, and outputting another; from a hacking perspective.
My RFC was they should be equal (\texttt{`==/2`}), but should not unify (\texttt{`=/2`}); now I realise this is a deeper issue, but the (more) informative error message from my RFC still stands, or the inconsistency from the prior section looks buggy, but Perl taught me bugs are byproducts of features, and now I've adapted this philosophy to Prolog.

\subsection{Compromisation}
My compromise is to just use Prolog the same as before I started learning Sanskrit, and do some"-thing" along the lines of this:
\\\texttt{devnum(D,N) :- ground(N), nth0(N,[}\zdn{०}\texttt{,}\zdn{१}\texttt{,}\zdn{२}\texttt{,}\zdn{३}\texttt{,}\zdn{४}\texttt{,}\zdn{५}\texttt{,}\zdn{६}\texttt{,}\zdn{७}\texttt{,}\zdn{८}\texttt{,}\zdn{९}\texttt{],D). \%commentreasure} %D never needs grounding, since D becomes N through unification; I'm not convinced this will even work as intended, but was planning on using atom_chars(Atomic_D,Lithp_D)

\section{I reject your' reality, and substitute my own!!! :D \#eject\_2:\_infinity,\_and\_beyond!!!\_:D}
I'm going to hack my own self-hosted dialect; reinventing each, and every, (key) component part of each, and every, wheel, while inventing some new ones, and \texttt{s/getting/setting/} the time right (this time), whilst I am, locatively, at it; future proofing it, so it is (well, and truly) ``good for over a hundred years'', or even ``good forever''; even after the heat death of the multiverse, and into the following cyclic multiverses.

\end{document}