Unicode symbols, and a possible numerical paradox!

I’ve updated this a little, but we keep fairly limited information. From the generated src/pl-umap.c, we get this list. U_DECIMAL is just added and deals with the Unicode category Nd. All details on how this is generated are in src/Unicode/prolog_syntax_map.pl, which generates src/pl-umap.c (and can also generate JavaScript to support SWISH (not tested whether the updates still produce valid JavaScript)).

#define U_ID_START           0x1
#define U_ID_CONTINUE        0x2
#define U_UPPERCASE          0x4
#define U_SEPARATOR          0x8
#define U_SYMBOL            0x10
#define U_OTHER             0x20
#define U_CONTROL           0x40
#define U_DECIMAL           0x80

These are accessible mostly through the char_type/2 types prolog_* and (new) decimal. See also SWI-Prolog -- Unicode Prolog source

This is now used in char_type/2, with a new type decimal (didn’t want to touch digit), so we can do this to enumerate all decimals with weight 6"

?- char_type(C, decimal(6)).
C = '6' ;
C = '٦' ;
C = '۶' ;
C = '߆' ;
C = '६' ;
C = '৬' ;
C = '੬' ;
C = '૬' ;
C = '୬' ;
C = '௬' 
...

To get a nice list as you have:

?- code_type(S, decimal(0)), E is S+9, numlist(S,E,L), format('~s~n', [L]).
0123456789
S = 48,
E = 57,
L = [48, 49, 50, 51, 52, 53, 54, 55, 56|...] ;
٠١٢٣٤٥٦٧٨٩
S = 1632,
E = 1641,
L = [1632, 1633, 1634, 1635, 1636, 1637, 1638, 1639, 1640|...] ;
۰۱۲۳۴۵۶۷۸۹
S = 1776,
E = 1785,
L = [1776, 1777, 1778, 1779, 1780, 1781, 1782, 1783, 1784|...] ;
߀߁߂߃߄߅߆߇߈߉
S = 1984,
E = 1993,
L = [1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992|...] ;
०१२३४५६७८९
S = 2406,
E = 2415,
L = [2406, 2407, 2408, 2409, 2410, 2411, 2412, 2413, 2414|...] 

The parser is also extended to read integers and rational numbers from other scripts, so

?- A = ४२.
A = 42.

Not yet handled are floating point numbers. Here is gets a little hairy. After all, Prolog deals with Prolog syntax, so what about signs, float “.”, grouping, exponent?

Non-decimal numbers (6'23, 0o57, 0x5b, 0b1001) are only supported using [0-9a-zA-Z].

Another area that may need attention is output. Now all numbers are written as Arabic numerals (0-9). We could introduce a flag zero and corresponding write_term/1 options to set the zero character code. So you can do this to get the input from above.

?- write_term(42, [zero(0'०)]).

Further comments and PRs welcome!

1 Like