Unicode symbols, and a possible numerical paradox!

A recent issue was opened on the SWI-Prolog GitHub discussions that should really be taking place here.

Unicode symbols, and a possible numerical paradox!


I came across a similar problem many years ago so the reply by Jan W. should be understood by all in a similar situation. Granted the situation may not appear to be the same when first encountered but it does come up enough to be warranted.

As the docs for char_type/2 tells you, the result depends on the output of the Unicode classification functions from <wctype.h>. So, the result may vary depending on the OS locale implementation and selected locale. This is unlike Prolog’s read/1 implementation which uses a table that is compiled from the official Unicode data (needs to be redone as it now uses Unicode 6.0). These types are available from char_type/2 as prolog_* types.

That leaves two questions. Is it a good idea for read/1 to parse digits from other scripts to numbers? At the moment number_codes/2 and friends use the same code as read/1, which I think is demanded the ISO standard. Second, is there a C(++) library for this? We probably have to do more work to deal with unbounded integers …

\documentclass[12pt]{article}

\usepackage{xltxtra}
\usepackage[none]{hyphenat}
\usepackage{hyperref}
\usepackage{tikz}
\usepackage{amssymb}
\usepackage{polyglossia}
 \setmainlanguage{english}
 \setotherlanguages{sanskrit}
 \newfontfamily\devanagarifont[Script=Devanagari]{Noto Serif Devanagari}

\newcommand{\zdn}[1]{\begin{sanskrit}#1\end{sanskrit}}
\newcommand{\zcode}[1]{\textasciigrave{}#1\textasciigrave{}}
\newcommand{\crossschwa}[1]{\lower.05cm\hbox{\rotatebox{45}{\begin{tikzpicture}[#1]
 \draw (-1,1) -- (-1,0) -- (1,0) -- (1,-1);
 \draw (-1,-1) -- (0,-1) -- (0,1) -- (1,1);
 \fill (-0.5,-0.5) circle[radius=2.25pt];
 \fill (-0.5,0.5) circle[radius=2.25pt];
 \fill (0.5,-0.5) circle[radius=2.25pt];
 \fill (0.5,0.5) circle[radius=2.25pt];
\end{tikzpicture}}}} % https://tex.stackexchange.com/q/263269

\title{issue(115).}
\author{\crossschwa{scale=0.16} \zdn{\char"0936\char"094D\char"092F\char"093E\char"092E} \crossschwa{scale=0.16}}
\date{HE 12022.06.23}

\begin{document}
\maketitle
\abstract{\zdn{\char"0950} In this \XeLaTeX{} document, I present my work in brainstorming how Prolog handles Indic scripts. Noto Serif Devanagari is the best font in the Raspberry Pi repos. All content used for academic purposes.}
\tableofcontents

\section{Unicode support in web browsers}
\zdn{𑥐𑥑𑥒𑥓𑥔𑥕𑥖𑥗𑥘𑥙} is the only one that Raspberry Pi doesn't support SOOTB-ly; tested both Firefox, and Chromium.
Maybe they should also be tested in \XeLaTeX{}, but I'm not sure about Polyglossia, nor its' Sanskrit environment, nor Devanagari fonts; I can't imagine the \href{https://ctan.org/pkg/sanskrit}{Sanskrit}, nor \href{https://ctan.org/pkg/devanagari}{Devanagari}, packages supporting Unicode either (I think they support Velthuis, or some``-thing'' similar).

\section{Choose one, and be consistent}
Standalone Indic numerals should match Indic numbers; \zcode{[\zdn{\char"0967},\zdn{\char"0968},\zdn{\char"0969}]} is legal, whilst \zcode{\zdn{\char"0967\char"0968\char"0969}} isn't; I can't imagine this discrepancy makes any sense.

\section{RFC 4: Hindu Numbers}
A more informative message should indicate why an error has been thrown, until this has been implemented; \texttt{\textbackslash{}texttt\{}\zcode{123 \textbackslash{}= \texttt{\}}\zdn{\char"0967\char"0968\char"0969}\texttt{\textbackslash{}texttt\{}, 123 == \texttt{\}}\zdn{\char"0967\char"0968\char"0969}\texttt{\textbackslash{}texttt\{}.}\texttt{\}} (Sometimes you just want to break all the rules!).
I'm assuming \texttt{=/2} is unification, and \texttt{==/2} is equality, but I can't remember; \texttt{=:=/2}, occurs check, and whatever else is out there… (I couldn't discover \texttt{\zcode{A = B, A \textbackslash{}== B; A \textbackslash{}= B, A == B.}}; its' type must be uninhabited for Turing-complete Prolog, where Shyam-completeness can implement any``-thing'', and every``-thing'' $\therefore 0$ type inhabitance is a cake baked by \zdn{\char"092E\char"093E\char"092F\char"093E{}}, where \texttt{\zcode{Roman = māyā.}}.)

\section{Real \texttt{/deva?s/} use \textbf{Dev}anagari!!! :D}
Sanskrit is the language of the multiverse, where life is a game of snakes, and ladders; my avatar is in a winning position $\because$ I was a \zdn{\char"0926\char"0947\char"0935{}} (\texttt{\zcode{Roman = deva}}) in my life before this. \#checkmate

Iff you learn Sanskrit, you'll feel enlightened by ``wingardium leviosa''; swish, and flick, your tongue. \#magically\_and\_notnot\_technologically\_rythmic\_advancements

\zdn{\char"092A\char"0941\char"0928\char"0930\char"094D{} \char"092E\char"093F\char"0932\char"093E\char"092E\char"0903 \char"0965{} \char"0967{} \char"0965{}} \#reincarnation

\end{document}

I’ve updated this a little, but we keep fairly limited information. From the generated src/pl-umap.c, we get this list. U_DECIMAL is just added and deals with the Unicode category Nd. All details on how this is generated are in src/Unicode/prolog_syntax_map.pl, which generates src/pl-umap.c (and can also generate JavaScript to support SWISH (not tested whether the updates still produce valid JavaScript)).

#define U_ID_START           0x1
#define U_ID_CONTINUE        0x2
#define U_UPPERCASE          0x4
#define U_SEPARATOR          0x8
#define U_SYMBOL            0x10
#define U_OTHER             0x20
#define U_CONTROL           0x40
#define U_DECIMAL           0x80

These are accessible mostly through the char_type/2 types prolog_* and (new) decimal. See also SWI-Prolog -- Unicode Prolog source

This is now used in char_type/2, with a new type decimal (didn’t want to touch digit), so we can do this to enumerate all decimals with weight 6"

?- char_type(C, decimal(6)).
C = '6' ;
C = '٦' ;
C = '۶' ;
C = '߆' ;
C = '६' ;
C = '৬' ;
C = '੬' ;
C = '૬' ;
C = '୬' ;
C = '௬' 
...

To get a nice list as you have:

?- code_type(S, decimal(0)), E is S+9, numlist(S,E,L), format('~s~n', [L]).
0123456789
S = 48,
E = 57,
L = [48, 49, 50, 51, 52, 53, 54, 55, 56|...] ;
٠١٢٣٤٥٦٧٨٩
S = 1632,
E = 1641,
L = [1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639, 1640|...] ;
۰۱۲۳۴۵۶۷۸۹
S = 1776,
E = 1785,
L = [1776, 1777, 1778, 1779, 1780, 1781, 1782, 1783, 1784|...] ;
߀߁߂߃߄߅߆߇߈߉
S = 1984,
E = 1993,
L = [1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992|...] ;
०१२३४५६७८९
S = 2406,
E = 2415,
L = [2406, 2407, 2408, 2409, 2410, 2411, 2412, 2413, 2414|...] 

The parser is also extended to read integers and rational numbers from other scripts, so

?- A = ४२.
A = 42.

Not yet handled are floating point numbers. Here is gets a little hairy. After all, Prolog deals with Prolog syntax, so what about signs, float “.”, grouping, exponent?

Non-decimal numbers (6'23, 0o57, 0x5b, 0b1001) are only supported using [0-9a-zA-Z].

Another area that may need attention is output. Now all numbers are written as Arabic numerals (0-9). We could introduce a flag zero and corresponding write_term/1 options to set the zero character code. So you can do this to get the input from above.

?- write_term(42, [zero(0'०)]).

Further comments and PRs welcome!

1 Like

For floats that is probably the simplest way around. It would also not be too bad. SWI-Prolog has a logarithmic buffering system that (optionally) starts with a local buffer to deal with virtually all places where temporary buffering without a clear upper bound is required. That helps a lot in reducing memory management costs (today’s memory managers are a lot better than when I started SWI-Prolog, but not avoiding them is even better).

Might implement that for floats.

P.s. Your implementation allows for e.g. A = 4२. if I read it correctly. I intentionally consider that a syntax error, i.e. all code points must come from the same 0…9 block. Note that there are a couple of adjacent 0-9 blocks and some code pages have multiple. So, I compute the 0 for the first (code-weight(code)) and demand all subsequent code points to be between this base and base+9.

Not in in the original Quintus version. On Richard O’Keefe’s suggestion SWI-Prolog now has ~:d which prints according to the current locale. Locale handling was added to SWI-Prolog quite a while ago. There is a default locale and it is possible to associate a specific locale to a stream, affecting write to this stream. The Prolog representation of the locale is bootstrapped from the system locale and can afterwards be copied and modified. We could add the decimal code point start to that. I guess we can bootstrap it using snprintf(), printing the integer 0 and see what it prints. No clue what happens when using a relevant locale.

I presume that there’s no need to handle Chinese/Japanese numbers? … they have multiple representations. For example, these are all the same value (and there are probably a few more permutations I haven’t thought of):

20306
20306
二万三百六
弐萬参佰陸
二零三零六
二〇三〇六

:wink:

I have just typed the “numbers” as query in ediprolog mode with utf-8 encoded buffer without knowing what I am doing. I have never expected that these representations except the standard one are treated as numbers in arithmetic expression, and seems enough for me that it is a string or atom. This is my personal attitude. Japanese has three or four suits of characters in daily usage, and I have not appropriate knowledge about encoding things, always escaping from character encoding matter.

% ?- number(20306).
%@ true.

% ?- number(20306).
%@ ERROR: Syntax error: Operator expected
%@ ERROR: number(
%@ ERROR: ** here **
%@ ERROR: 20306) . 

% ?- number(二万三百六).
%@ false.
% ?- number(弐萬参佰陸).
%@ false.

% ?- number(二零三零六).
%@ false.

% ?- number(二〇三〇六).
%@ false.

冗談半分 (I was half-joking)

However, the handling of double-width digits seems to be a bit strange. Sequences of kanji are treated as atoms:

?- atom(二万三千).
true.
?- atom(冗談戯談串戯串戲).
true.

A sequence of double-width digits is an error:

?- atom(20306).
ERROR: Syntax error: Operator expected
ERROR: atom(
ERROR: ** here **
ERROR: 20306) . 
?- atom('20306').
true.

But double-width digits are OK when preceded by a kanji:

?- atom(二万三千2030).
true.

Before people go too far on this, please realize that 二万三百六 is “two-ten-thousand-three-hundred-six”, which when written with digits is 20306. There are also double-width digits (the width of a kanji) which (as I mentioned elsewhere) seem to be handled differently: 20306. If you’re doing CJK processing, you distinguish between “half-width” (Latin letters, digits, punctuation) and “full-width” (kanji, digits, letters, punctuation). It’s quite messy and – as @kuniaki.mukai says – I wouldn’t expect full-width digits to be treated as regular digits. And, yes, there are full-width letters:

?- atom(あbcで).
true.

This is off topics perhapse. I said I always escape from encoding things, but I remember one thing related to character order. I wrote a prolog codes to generate a minimum state automata for a regular expression. Genereted state transition table is order-sensitive for input letters. In fact, the input letters are all non-negative integers, i.e. virtually infinite, and letter entry of the table are disjoint intervals of integers. From curiosity, I tested the generator for the six regex below to see the letter intervals in the output. As expected CJK numereral ltters do not form interval of integers.

“(a|b|c|d|e)"
"(a|b|c|e)

“(1|2|3|4|5)"
"(1|2|3|5)

“(一|二|三|四|五)"
"(一|二|三|五)

You can test if interested in my cgi page

http://web.sfc.keio.ac.jp/~mukai/paccgi7/index.html

So encoding thing is open for me with the chacter interval aware automaton, though I have no plan to investigate because of my little passion to pursue this messy thing.

Thanks. I will check the pointer. For me, the concept of letter was not clear and still so, and also code of the letter is not clear. It is easier to think that code comes first, then, to letter or its figure the code is assigned. also I haven’t heard of a rigourous (topological) definition of a letter enough to judge that a font is for the letter not for the others. I may be still in a beginner’s confusion about encoding things.

From the docs,

Atoms and Variables
We handle them in one item as they are closely related. The Unicode
standard defines a syntax for identifiers in computer languages. In
this syntax identifiers start with ID_Start followed by a sequence
of ID_Continue codes. Such sequences are handled as a single token
in SWI-Prolog. The token is a variable iff it starts with an
uppercase character or an underscore (_). Otherwise it is an atom.
Note that many languages do not have the notion of character case.
In such languages variables must be written as _name.

AFAIK, the Unicode N* category (numerical symbols) is part of ID_Continue. That explains the old behavior. The new version handles Nd (decimal numerical symbols) as digits. I don’t know how useful this is :slight_smile:

While I escaped from the encoding problem, now I feel ashamed to see that rather European prologers have miuch precise knowlege than me who should be native about CJK encoding. I have to pick out a book on my book shelf by an expert (Keisuke Yano) on character encoding. I hope I will catch up with what is discussed here. Thanks.

FYI, SWISH has been updated, so all the stuff discussed can be tested and demonstrated there. As far as I’m concerned, this is for now the end of this. Whether and where support for non-ASCII numerals must be included should be discussed in a broader context. Adding support for non-decimal systems is far too much work for something that first needs some proof of its value.

Note that editor syntax highlighting in SWISH does not work properly as the SWISH Prolog tokenizer in JavaScript has not been updated.

1 Like

I have skimmed the book (ISBN 978-4-7741-4164-0) in Japanse, and found a square enclosed page (p145) on unicode character database, that explains a case in which a table is prepared as interface for use of the encoded data base content, and also met some lines somewhere about unihan.

Also I remember that some Japanse groups on TEX/LATEX have been active and so active that now platex, uplatex, xelatex and so on are now included texlive/mactex distribution. One of contributer among many others is a researcher on classic/archivals documents which were written in old japanese characters which is not fully covered with even UTF-8.

Also we use a unix smart command line tool NKF to convert character encoding of files, to which, I heard, a Japanese made a central contribution. Thanks to them, I could spend almost an easy life w.r.t character encoding by using UTF-8 encoding for everything. Once upon a time, we used to spend a hard time compromising EUC and JIS encoding, but now we almost forget such troubles.

The standard text in English for Chinese/Japanese/Korean/Vietnamese processing is CJKV Information Processing, 2nd Edition [Book]

It’s a bit obsolete (and I’ve lost my copy) now that Unicode has supplanted EUC, JIS, Shift-JIS, Big5, etc. (good-bye 文字化け moji-bake) but was essential reading for anyone who had to deal with CJK text ~25 years ago. (There still remain plenty of gotchas, though, such as 全角・半角.)

Anyway, I wouldn’t expect kanji numbers to be treated the same as ASCII digits, nor would I expect 令和4年6月24日 to be automatically translated to 2022-06-24. And I doubt that anyone would care if wide digits ( 2022−06−24) need to be quoted.

According to Wikipedia, “6.0% of sites in the .jp domain. UTF-8 is used by 93.8% of Japanese websites.” So, it’s just a ghost of its former self.

I agree. But office workers might think differently, because now they may be busy toward paperless offices.

BTW, recently researchers in India expressed thanks to Japanese temples, which still uses Sanscript original letter (梵字?) for some purpose. The letter was lost in India. One of good or bad Japanese traditions is to keep imported things as they are and revise a little bit them keeping respects. I heard that many Chinese Kanji words for modern concepts are added by Japanese (60 % ?) since Meiji (1870).

I hope this perhaps good tradition may not be an obstacle to prevent from shifting to paperless society. This is only my opinion. I never talked to others on this who think differently of course.

I prefer utf-8 to utf-16 without such statistics. Perhaps I do so because Apple supports utf-8.

BTW a kanji 熊 (bear) is a trauma for me. About 30 years ago, I took a long time for debugging on Prolog codes which simply reads a file, and found finally that Kanj code contains an ascii control code which stopped the program. It was EUC or Shift-JIS encoding, I don’t remember which. I never thought such kind of errors caused by text encoding. Then I heard that UTF-8 has no such problem, which is a simple but main reason that I prefer UTF-8.

Thanks for useful comments. By saying that Apple supprts utf-8, I meant that Apple default encoding is utf-8. By “default encoding” I meant that Apple products e.g. TextEdit.app accepts utf-8 file, but I suspect, it does not accept utf-16 file without showing a window on changing encoding, though I have not tested any, but based on experience on macOS. Unfortunately, I have no occasion to touch Windows OS. Also by default any defualt text files are in utf-8 including iOS, iPadOS, and macOS. I guess it is Apple’s policy for its recent universal access functionality.