# Unicode symbols, and a possible numerical paradox!

A recent issue was opened on the SWI-Prolog GitHub discussions that should really be taking place here.

Unicode symbols, and a possible numerical paradox!

I came across a similar problem many years ago so the reply by Jan W. should be understood by all in a similar situation. Granted the situation may not appear to be the same when first encountered but it does come up enough to be warranted.

This would require that your Prolog system understands numeric values
of Unicode code points, and then uses these numeric values when parsing
number tokens. It seems SWI-Prolog on Windows cannot do it, I get:

/* SWI-Prolog 8.5.12 */
?- X = [०,१,२,३,४,५,६,७,८,९].
X = ['०', '१', '२', '३', '४', '५', '६', '७', '८'|...].
?- char_code('१', X).
X = 2407.
?- code_type(2407, digit(X)).
false.


So all the Sanscrit numerals are read as atoms. A Prolog system
that understands Sanscrit numerals can parse it as follows:

?- X = [०,१,२,३,४,५,६,७,८,९].
X = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].
?- code_numeric(2407, X).
X = 1.


In Jekejeke Prolog I get this for free from Java Integer.parseInt() and
in Dogelog I had to do it a more elaborate way. The common basis for both
systems is UAX 44 numeric value.

In Jekejeke Prolog I have also switch font to Nirmala UI, to see the glyphs,
on the other hand in the browser for Dogelog, I don’t need to do such
a font switch, seems to work with the default font.

As the docs for char_type/2 tells you, the result depends on the output of the Unicode classification functions from <wctype.h>. So, the result may vary depending on the OS locale implementation and selected locale. This is unlike Prolog’s read/1 implementation which uses a table that is compiled from the official Unicode data (needs to be redone as it now uses Unicode 6.0). These types are available from char_type/2 as prolog_* types.

That leaves two questions. Is it a good idea for read/1 to parse digits from other scripts to numbers? At the moment number_codes/2 and friends use the same code as read/1, which I think is demanded the ISO standard. Second, is there a C(++) library for this? We probably have to do more work to deal with unbounded integers …

Is there an SWI-Prolog API to access this Unicode data?
What properties are stored? There are not so many numeric
values to store, for example I get:

?- findall(C, (between(0,0x10FFFF,C), code_numeric(C, X), 0=<X, X=<9), L),
length(L, N), write(N), nl, fail.
650


There are more such values for radix > 10. The above are only the
decimals digits. The main use cases are Arabic Scripts and Indian Scripts
I guess, and a few others. And glyph style variants that went into Unicode.

Edit 22.06.2022:
Browser doesn’t even show all glyphs:

0123456789
٠١٢٣٤٥٦٧٨٩
۰۱۲۳۴۵۶۷۸۹
߀߁߂߃߄߅߆߇߈߉
०१२३४५६७८९
০১২৩৪৫৬৭৮৯
੦੧੨੩੪੫੬੭੮੯
૦૧૨૩૪૫૬૭૮૯
୦୧୨୩୪୫୬୭୮୯
௦௧௨௩௪௫௬௭௮௯
౦౧౨౩౪౫౬౭౮౯
೦೧೨೩೪೫೬೭೮೯
൦൧൨൩൪൫൬൭൮൯
෦෧෨෩෪෫෬෭෮෯
๐๑๒๓๔๕๖๗๘๙
໐໑໒໓໔໕໖໗໘໙
༠༡༢༣༤༥༦༧༨༩
၀၁၂၃၄၅၆၇၈၉
႐႑႒႓႔႕႖႗႘႙
០១២៣៤៥៦៧៨៩
᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙
᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏
᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙
᪀᪁᪂᪃᪄᪅᪆᪇᪈᪉
᪐᪑᪒᪓᪔᪕᪖᪗᪘᪙
᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙
᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹
᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉
᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙
꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩
꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙
꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉
꧐꧑꧒꧓꧔꧕꧖꧗꧘꧙
꧰꧱꧲꧳꧴꧵꧶꧷꧸꧹
꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙
꯰꯱꯲꯳꯴꯵꯶꯷꯸꯹
０１２３４５６７８９
𐒠𐒡𐒢𐒣𐒤𐒥𐒦𐒧𐒨𐒩
𐴰𐴱𐴲𐴳𐴴𐴵𐴶𐴷𐴸𐴹
𑁦𑁧𑁨𑁩𑁪𑁫𑁬𑁭𑁮𑁯
𑃰𑃱𑃲𑃳𑃴𑃵𑃶𑃷𑃸𑃹
𑄶𑄷𑄸𑄹𑄺𑄻𑄼𑄽𑄾𑄿
𑇐𑇑𑇒𑇓𑇔𑇕𑇖𑇗𑇘𑇙
𑋰𑋱𑋲𑋳𑋴𑋵𑋶𑋷𑋸𑋹
𑑐𑑑𑑒𑑓𑑔𑑕𑑖𑑗𑑘𑑙
𑓐𑓑𑓒𑓓𑓔𑓕𑓖𑓗𑓘𑓙
𑙐𑙑𑙒𑙓𑙔𑙕𑙖𑙗𑙘𑙙
𑛀𑛁𑛂𑛃𑛄𑛅𑛆𑛇𑛈𑛉
𑜰𑜱𑜲𑜳𑜴𑜵𑜶𑜷𑜸𑜹
𑣠𑣡𑣢𑣣𑣤𑣥𑣦𑣧𑣨𑣩
𑥐𑥑𑥒𑥓𑥔𑥕𑥖𑥗𑥘𑥙
𑱐𑱑𑱒𑱓𑱔𑱕𑱖𑱗𑱘𑱙
𑵐𑵑𑵒𑵓𑵔𑵕𑵖𑵗𑵘𑵙
𑶠𑶡𑶢𑶣𑶤𑶥𑶦𑶧𑶨𑶩
𖩠𖩡𖩢𖩣𖩤𖩥𖩦𖩧𖩨𖩩
𖭐𖭑𖭒𖭓𖭔𖭕𖭖𖭗𖭘𖭙
𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗
𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡
𝟢𝟣𝟤𝟥𝟦𝟧𝟨𝟩𝟪𝟫
𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵
𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿
𞅀𞅁𞅂𞅃𞅄𞅅𞅆𞅇𞅈𞅉
𞋰𞋱𞋲𞋳𞋴𞋵𞋶𞋷𞋸𞋹
𞥐𞥑𞥒𞥓𞥔𞥕𞥖𞥗𞥘𞥙
🯰🯱🯲🯳🯴🯵🯶🯷🯸🯹

\documentclass[12pt]{article}

\usepackage{xltxtra}
\usepackage[none]{hyphenat}
\usepackage{hyperref}
\usepackage{tikz}
\usepackage{amssymb}
\usepackage{polyglossia}
\setmainlanguage{english}
\setotherlanguages{sanskrit}
\newfontfamily\devanagarifont[Script=Devanagari]{Noto Serif Devanagari}

\newcommand{\zdn}[1]{\begin{sanskrit}#1\end{sanskrit}}
\newcommand{\zcode}[1]{\textasciigrave{}#1\textasciigrave{}}
\newcommand{\crossschwa}[1]{\lower.05cm\hbox{\rotatebox{45}{\begin{tikzpicture}[#1]
\draw (-1,1) -- (-1,0) -- (1,0) -- (1,-1);
\draw (-1,-1) -- (0,-1) -- (0,1) -- (1,1);
\end{tikzpicture}}}} % https://tex.stackexchange.com/q/263269

\title{issue(115).}
\author{\crossschwa{scale=0.16} \zdn{\char"0936\char"094D\char"092F\char"093E\char"092E} \crossschwa{scale=0.16}}
\date{HE 12022.06.23}

\begin{document}
\maketitle
\abstract{\zdn{\char"0950} In this \XeLaTeX{} document, I present my work in brainstorming how Prolog handles Indic scripts. Noto Serif Devanagari is the best font in the Raspberry Pi repos. All content used for academic purposes.}
\tableofcontents

\section{Unicode support in web browsers}
\zdn{𑥐𑥑𑥒𑥓𑥔𑥕𑥖𑥗𑥘𑥙} is the only one that Raspberry Pi doesn't support SOOTB-ly; tested both Firefox, and Chromium.
Maybe they should also be tested in \XeLaTeX{}, but I'm not sure about Polyglossia, nor its' Sanskrit environment, nor Devanagari fonts; I can't imagine the \href{https://ctan.org/pkg/sanskrit}{Sanskrit}, nor \href{https://ctan.org/pkg/devanagari}{Devanagari}, packages supporting Unicode either (I think they support Velthuis, or some-thing'' similar).

\section{Choose one, and be consistent}
Standalone Indic numerals should match Indic numbers; \zcode{[\zdn{\char"0967},\zdn{\char"0968},\zdn{\char"0969}]} is legal, whilst \zcode{\zdn{\char"0967\char"0968\char"0969}} isn't; I can't imagine this discrepancy makes any sense.

\section{RFC 4: Hindu Numbers}
A more informative message should indicate why an error has been thrown, until this has been implemented; \texttt{\textbackslash{}texttt\{}\zcode{123 \textbackslash{}= \texttt{\}}\zdn{\char"0967\char"0968\char"0969}\texttt{\textbackslash{}texttt\{}, 123 == \texttt{\}}\zdn{\char"0967\char"0968\char"0969}\texttt{\textbackslash{}texttt\{}.}\texttt{\}} (Sometimes you just want to break all the rules!).
I'm assuming \texttt{=/2} is unification, and \texttt{==/2} is equality, but I can't remember; \texttt{=:=/2}, occurs check, and whatever else is out there… (I couldn't discover \texttt{\zcode{A = B, A \textbackslash{}== B; A \textbackslash{}= B, A == B.}}; its' type must be uninhabited for Turing-complete Prolog, where Shyam-completeness can implement any-thing'', and every-thing'' $\therefore 0$ type inhabitance is a cake baked by \zdn{\char"092E\char"093E\char"092F\char"093E{}}, where \texttt{\zcode{Roman = māyā.}}.)

\section{Real \texttt{/deva?s/} use \textbf{Dev}anagari!!! :D}
Sanskrit is the language of the multiverse, where life is a game of snakes, and ladders; my avatar is in a winning position $\because$ I was a \zdn{\char"0926\char"0947\char"0935{}} (\texttt{\zcode{Roman = deva}}) in my life before this. \#checkmate

Iff you learn Sanskrit, you'll feel enlightened by wingardium leviosa''; swish, and flick, your tongue. \#magically\_and\_notnot\_technologically\_rythmic\_advancements

\zdn{\char"092A\char"0941\char"0928\char"0930\char"094D{} \char"092E\char"093F\char"0932\char"093E\char"092E\char"0903 \char"0965{} \char"0967{} \char"0965{}} \#reincarnation

\end{document}


I’ve updated this a little, but we keep fairly limited information. From the generated src/pl-umap.c, we get this list. U_DECIMAL is just added and deals with the Unicode category Nd. All details on how this is generated are in src/Unicode/prolog_syntax_map.pl, which generates src/pl-umap.c (and can also generate JavaScript to support SWISH (not tested whether the updates still produce valid JavaScript)).

#define U_ID_START           0x1
#define U_ID_CONTINUE        0x2
#define U_UPPERCASE          0x4
#define U_SEPARATOR          0x8
#define U_SYMBOL            0x10
#define U_OTHER             0x20
#define U_CONTROL           0x40
#define U_DECIMAL           0x80


These are accessible mostly through the char_type/2 types prolog_* and (new) decimal. See also SWI-Prolog -- Unicode Prolog source

This is now used in char_type/2, with a new type decimal (didn’t want to touch digit), so we can do this to enumerate all decimals with weight 6"

?- char_type(C, decimal(6)).
C = '6' ;
C = '٦' ;
C = '۶' ;
C = '߆' ;
C = '६' ;
C = '৬' ;
C = '੬' ;
C = '૬' ;
C = '୬' ;
C = '௬'
...


To get a nice list as you have:

?- code_type(S, decimal(0)), E is S+9, numlist(S,E,L), format('~s~n', [L]).
0123456789
S = 48,
E = 57,
L = [48, 49, 50, 51, 52, 53, 54, 55, 56|...] ;
٠١٢٣٤٥٦٧٨٩
S = 1632,
E = 1641,
L = [1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639, 1640|...] ;
۰۱۲۳۴۵۶۷۸۹
S = 1776,
E = 1785,
L = [1776, 1777, 1778, 1779, 1780, 1781, 1782, 1783, 1784|...] ;
߀߁߂߃߄߅߆߇߈߉
S = 1984,
E = 1993,
L = [1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992|...] ;
०१२३४५६७८९
S = 2406,
E = 2415,
L = [2406, 2407, 2408, 2409, 2410, 2411, 2412, 2413, 2414|...]


The parser is also extended to read integers and rational numbers from other scripts, so

?- A = ४२.
A = 42.


Not yet handled are floating point numbers. Here is gets a little hairy. After all, Prolog deals with Prolog syntax, so what about signs, float “.”, grouping, exponent?

Non-decimal numbers (6'23, 0o57, 0x5b, 0b1001) are only supported using [0-9a-zA-Z].

Another area that may need attention is output. Now all numbers are written as Arabic numerals (0-9). We could introduce a flag zero and corresponding write_term/1 options to set the zero character code. So you can do this to get the input from above.

?- write_term(42, [zero(0'०)]).


1 Like

Cool. I solved floats and bignums by a routine ascii_replace() in Dogelog,
this is just called before number conversion is done, so that number conversion

function ascii_replace(text, radix) {
let buf = "";
let last = 0;
let pos = 0;
while (pos < text.length) {
let ch = text.codePointAt(pos);
let val = code_numeric(ch);
if (val >= 0 && val < radix) {
if (ch <= 127) { // ASCII
/* */
} else {
if (val < 10) {
val += 48; // '0'
} else {
val += 55; // 'A'-10
}
buf += text.slice(last, pos) + String.fromCodePoint(val);
last = pos + char_count(ch);
}
} else {
/*  */
}
pos += char_count(ch);
}
if (last !== 0) {
buf += text.slice(last);
return buf;
} else {
return text;
}
}


Its a little tricky. Its surrogate pairs aware, thats why it uses char_count to advance
the position. This might not be an issue for SWI-Prolog. And it does replacement only
if needed. If you supply it with a number that is already ASCII it gives back the original string.

I guess it can be rather inefficient for things that are non-ASCII, since I use string
manipulation. If you have logarithmic buffers, that grow by doubling their size, you
can do it faster and replace the buf += by these buffers. But currently I didn’t bother

making it fast for non-ASCII numbers.

Edit 23.06.2022:
Using alternative periods grouping separators etc… I didn’t persue that. The period is
fixed to be “.”, cannot be for example “,”. And so on, just fixed as in the ISO core standard.
The above routine returns punctuation such as “.”, “,” unchanged, also a sign “-”

or “+” is returned unchanged, it only replaces characters that have a numeric value
in the given radix range. You need to have a fast code_numeric() to make it fly.
If code_numeric() is slow, you get also a penality for ASCII only numbers.

No new string will be created, but the scanning becomes costly.

For floats that is probably the simplest way around. It would also not be too bad. SWI-Prolog has a logarithmic buffering system that (optionally) starts with a local buffer to deal with virtually all places where temporary buffering without a clear upper bound is required. That helps a lot in reducing memory management costs (today’s memory managers are a lot better than when I started SWI-Prolog, but not avoiding them is even better).

Might implement that for floats.

P.s. Your implementation allows for e.g. A = 4२. if I read it correctly. I intentionally consider that a syntax error, i.e. all code points must come from the same 0…9 block. Note that there are a couple of adjacent 0-9 blocks and some code pages have multiple. So, I compute the 0 for the first (code-weight(code)) and demand all subsequent code points to be between this base and base+9.

From inspecting src/pl-umap.c, looks like you use a two level access,
which is also fast! Not sure whether the second level, the 256 blocks
are merged, i.e. identical blocks put together? If you put them together,

I get this compression. Only 150 different 256 blocks are needed?

pool.size()*PARA_SIZE=   38400
pool2.size()*BLOCK_SIZE= 4096
MAX_BLOCK=               1
Total=                   42497


I am currently using a 3 level compression in Dogelog. You see that
it is 3 level, since MAX_BLOCK is not anymore 1. It has different
parameters and the umap gets smaller:

pool.size()*PARA_SIZE=   10496
pool2.size()*BLOCK_SIZE= 3904
MAX_BLOCK=               1088
Total=                   15488


Thats 36% smaller, the ratio 15488 / 42497. This matters sind Dogelog
goes over the wire to the browser. I saw some Google code some months
ago, where they use a more drastic compression, but I did not yet

dig up what they did…

Edit 23.06.2022:
If only I would find the Google code again. Maybe they pack the maps
and then unpack as an intialization step of the system? This would also
reduce the size that goes over the wire? Might not be an issue

for SWI-Prolog. But whats a little frightening the Unicode database is
ever growing. So from JDK 1.8 I had ca. 12’000 as the compressed size,
and for JDK 16 it leaped to ca. 15’000. So researching compressions

is an investment into the future, this thing might get still bigger and bigger,
or there might be demand for new properties, which then might spoil
a compression, and give a new challenge.

Interesting, a new feature in the form of a write option:

I have the feeling this belongs to Prolog format/2 family of predicates.
Which works differently in Java, but from Java I know the corresponding
format/2 family can have localized punctuation, currency symbols etc…

Not sure whether the Prolog format/2 family ever went into this direction?
But then I see that Java has also some further utility, a numeric shaper!
Which according to documentation looks at the context:

// shape European digits to ARABIC digits if preceding text is Arabic, or
// shape European digits to TAMIL digits if preceding text is Tamil, or
// leave European digits alone if there is no preceding text, or
// preceding text is neither Arabic nor Tamil


https://docs.oracle.com/javase/7/docs/api/java/awt/font/NumericShaper.html

So I got this idea right now, translated to Prolog: If you have an answer substitution
with a Sanskrit Variable name, it should show a number answer in Sanskrit Script?
Maybe this can be generalized, the last alphanum token (variable or atom, it

would thus ignore symbol) during output, determines the number shaping?

Not in in the original Quintus version. On Richard O’Keefe’s suggestion SWI-Prolog now has ~:d which prints according to the current locale. Locale handling was added to SWI-Prolog quite a while ago. There is a default locale and it is possible to associate a specific locale to a stream, affecting write to this stream. The Prolog representation of the locale is bootstrapped from the system locale and can afterwards be copied and modified. We could add the decimal code point start to that. I guess we can bootstrap it using snprintf(), printing the integer 0 and see what it prints. No clue what happens when using a relevant locale.

I presume that there’s no need to handle Chinese/Japanese numbers? … they have multiple representations. For example, these are all the same value (and there are probably a few more permutations I haven’t thought of):

20306
２０３０６



I have just typed the “numbers” as query in ediprolog mode with utf-8 encoded buffer without knowing what I am doing. I have never expected that these representations except the standard one are treated as numbers in arithmetic expression, and seems enough for me that it is a string or atom. This is my personal attitude. Japanese has three or four suits of characters in daily usage, and I have not appropriate knowledge about encoding things, always escaping from character encoding matter.

% ?- number(20306).
%@ true.

% ?- number(２０３０６).
%@ ERROR: Syntax error: Operator expected
%@ ERROR: number(
%@ ERROR: ** here **
%@ ERROR: ２０３０６) .

% ?- number(二万三百六).
%@ false.
% ?- number(弐萬参佰陸).
%@ false.

% ?- number(二零三零六).
%@ false.

% ?- number(二〇三〇六).
%@ false.


However, the handling of double-width digits seems to be a bit strange. Sequences of kanji are treated as atoms:

?- atom(二万三千).
true.
?- atom(冗談戯談串戯串戲).
true.


A sequence of double-width digits is an error:

?- atom(２０３０６).
ERROR: Syntax error: Operator expected
ERROR: atom(
ERROR: ** here **
ERROR: ２０３０６) .
?- atom('２０３０６').
true.


But double-width digits are OK when preceded by a kanji:

?- atom(二万三千２０３０).
true.


Before people go too far on this, please realize that 二万三百六 is “two-ten-thousand-three-hundred-six”, which when written with digits is 20306. There are also double-width digits (the width of a kanji) which (as I mentioned elsewhere) seem to be handled differently: ２０３０６. If you’re doing CJK processing, you distinguish between “half-width” (Latin letters, digits, punctuation) and “full-width” (kanji, digits, letters, punctuation). It’s quite messy and – as @kuniaki.mukai says – I wouldn’t expect full-width digits to be treated as regular digits. And, yes, there are full-width letters:

?- atom(あｂｃで).
true.


Works fine on my side, Dogelog and Jekejeke, and will probably also
work fine in SWI-Prolog as soon as the next daily build is out,
if I understand this post here by Jan W. correctly, we will have:

?- A = ２０３０６.
A = 20306.


This is because its already in Unicode® Standard Annex #44
[ UNICODE CHARACTER DATABASE ] ?

FULLWIDTH DIGIT TWO
European Number
Unicode Character 'FULLWIDTH DIGIT TWO' (U+FF12)

For Unihan you would need also tap into Unicode® Standard
Annex #38 [ UNICODE HAN DATABASE (UNIHAN) ] ?

This is off topics perhapse. I said I always escape from encoding things, but I remember one thing related to character order. I wrote a prolog codes to generate a minimum state automata for a regular expression. Genereted state transition table is order-sensitive for input letters. In fact, the input letters are all non-negative integers, i.e. virtually infinite, and letter entry of the table are disjoint intervals of integers. From curiosity, I tested the generator for the six regex below to see the letter intervals in the output. As expected CJK numereral ltters do not form interval of integers.

“(a|b|c|d|e)"
"(a|b|c|e)

“(1|2|3|4|5)"
"(1|2|3|5)

“(一|二|三|四|五)"
"(一|二|三|五)

You can test if interested in my cgi page

http://web.sfc.keio.ac.jp/~mukai/paccgi7/index.html

So encoding thing is open for me with the chacter interval aware automaton, though I have no plan to investigate because of my little passion to pursue this messy thing.

I guess they are anyway out of scope of what Jan W. did.
Since they are not in UAX 44, only in UAX 38. But to be
100% sure, will need to test when the next daily build is out.

That they are not consecutive Unicode code points you
also see without the new change of Jan W.:

?- set_prolog_flag(double_quotes, codes).
true.
?- X = "一二三四五".
X = [19968, 20108, 19977, 22235, 20116].


Edit 23.06.2022:
Yes indeed no Unihan, I find in the documentation of Jan W. umap converter:

Before running one must obtain the following Unicode data files:

• UnicodeData.txt
• DerivedCoreProperties.txt

https://github.com/SWI-Prolog/swipl-devel/tree/master/src/Unicode

If you check UnicodeData.txt for example “二” U+4E8C CJK Unified
Ideograph-4E8C Unicode Character is not defined in it. No data for it.
This is because UnicodeData.txt is only UAX 44, and no UAX 38.

Thanks. I will check the pointer. For me, the concept of letter was not clear and still so, and also code of the letter is not clear. It is easier to think that code comes first, then, to letter or its figure the code is assigned. also I haven’t heard of a rigourous (topological) definition of a letter enough to judge that a font is for the letter not for the others. I may be still in a beginner’s confusion about encoding things.

From the docs,

Atoms and Variables
We handle them in one item as they are closely related. The Unicode
standard defines a syntax for identifiers in computer languages. In