Unicode symbols, and a possible numerical paradox!

In Unicode the term “glyph” is used. A glyph can be a digit, a letter etc…

Examples:
2 : Digit Two
D : Latin Capital letter D

Unfortunately a “glyph” might be not that unique, for example the dollar sign $
can be represented with one or two vertical bars.

In Unicode a glyph then gets assigned a code point in the range 0x0000 … 0x10FFFF.
This is valid for Unicode® Standard Annex #44 [ UNICODE CHARACTER DATABASE ],
and the category Nd, mentioned by Jan W., is also from this database.

For Unihan inside Unicode a different entity is represented. The basic entity there
is an “abstract character” covering not only glyphs but also graphemes. I guess
radicals belong also to the category of graphemes. The problem is that the first

digit from the running example by @peter.ludemann, is a radical:

二: Example composites that contain the radical 貳、于、云、些
https://en.wikipedia.org/wiki/Kangxi_radical#Table_of_radicals

These smaller units can be used for decomposition of larger units. But some such
compositioning is also found in the non-Unihan part of Unicode, so I wonder what
is the big dfference, and whether its only some

packaging of Unicode into two parts?

Edit 24.06.2022:
There is also an ISO standard that runs parallel to Unicode. Its freely downloadable:

ISO/IEC 10646 - Universal coded character set (UCS)
https://standards.iso.org/ittf/PubliclyAvailableStandards/c076835_ISO_IEC_10646_2020(E).zip

Here in table 1 of ISO/IEC 10646 you also see that “letter” is only a subset
of the assigned glyphs and the other stuff that gets also assigned:

While I escaped from the encoding problem, now I feel ashamed to see that rather European prologers have miuch precise knowlege than me who should be native about CJK encoding. I have to pick out a book on my book shelf by an expert (Keisuke Yano) on character encoding. I hope I will catch up with what is discussed here. Thanks.

Sorry to disappoint you, but it is not (NOT) discussed here. You
are still misunderstanding. UAX 44 does not (DOES NOT) cover
CJK (Chinese, Japanese and Korean). Its out of scope.

@peter.ludemann introduced oranges, i.e. CJK, and you
@kuniaki.mukai picked up the topic and started talking about
oranges, i.e. CJK. On the other hand Jan W. and myself

were only considering apples, i.e. non-CJK, no oranges.

Corollary: Since we were only talking about apples, we
possibly have no knowledge of oranges. Most likely only
superficial knowledge from reading some wikipedia pages.

Edit 24.06.2022:
@kuniaki.mukai , ok its there! Please download the daily build
swipl-w64-2022-06-24.exe , I now get the following results. This
is all apples, and not oranges, no CJK involved:

/* Welcome to SWI-Prolog (threaded, 64 bits, version 8.5.13-8-gb920051a7) */
?- A = ४२.
A = 42.

?- X = 20306.
X = 20306.

If you try oranges on the other hand, i.e. CJK, it still refuses,
no number conversion is done:

?- X = 二.
X = 二.

Here is a screenshot, no bluffing, its all there now, build has stamp gb920051a7.
Was using a Windows 10 machine for testing:

The problem is the common term Unicode Database usually only refers to
UAX 44. And Jan W. also referes to Unicode Database. If you would like to
include CJK, you would need to also include the Unihan Unicode Database which

is UAX 38. Maybe you find people that want to discuss that? You could open
a new thread? Or continue in this thread? Not sure what is the best solution.
I would like to rather restrict the scope of this thread to UAX 44, and

not include UAX 38. Although CJK is of course also a fascinating topic.

FYI, SWISH has been updated, so all the stuff discussed can be tested and demonstrated there. As far as I’m concerned, this is for now the end of this. Whether and where support for non-ASCII numerals must be included should be discussed in a broader context. Adding support for non-decimal systems is far too much work for something that first needs some proof of its value.

Note that editor syntax highlighting in SWISH does not work properly as the SWISH Prolog tokenizer in JavaScript has not been updated.

1 Like

I have skimmed the book (ISBN 978-4-7741-4164-0) in Japanse, and found a square enclosed page (p145) on unicode character database, that explains a case in which a table is prepared as interface for use of the encoded data base content, and also met some lines somewhere about unihan.

Also I remember that some Japanse groups on TEX/LATEX have been active and so active that now platex, uplatex, xelatex and so on are now included texlive/mactex distribution. One of contributer among many others is a researcher on classic/archivals documents which were written in old japanese characters which is not fully covered with even UTF-8.

Also we use a unix smart command line tool NKF to convert character encoding of files, to which, I heard, a Japanese made a central contribution. Thanks to them, I could spend almost an easy life w.r.t character encoding by using UTF-8 encoding for everything. Once upon a time, we used to spend a hard time compromising EUC and JIS encoding, but now we almost forget such troubles.

The standard text in English for Chinese/Japanese/Korean/Vietnamese processing is CJKV Information Processing, 2nd Edition [Book]

It’s a bit obsolete (and I’ve lost my copy) now that Unicode has supplanted EUC, JIS, Shift-JIS, Big5, etc. (good-bye 文字化け moji-bake) but was essential reading for anyone who had to deal with CJK text ~25 years ago. (There still remain plenty of gotchas, though, such as 全角・半角.)

Anyway, I wouldn’t expect kanji numbers to be treated the same as ASCII digits, nor would I expect 令和4年6月24日 to be automatically translated to 2022-06-24. And I doubt that anyone would care if wide digits ( 2022−06−24) need to be quoted.

According to Wikipedia, “6.0% of sites in the .jp domain. UTF-8 is used by 93.8% of Japanese websites.” So, it’s just a ghost of its former self.

I agree. But office workers might think differently, because now they may be busy toward paperless offices.

BTW, recently researchers in India expressed thanks to Japanese temples, which still uses Sanscript original letter (梵字?) for some purpose. The letter was lost in India. One of good or bad Japanese traditions is to keep imported things as they are and revise a little bit them keeping respects. I heard that many Chinese Kanji words for modern concepts are added by Japanese (60 % ?) since Meiji (1870).

I hope this perhaps good tradition may not be an obstacle to prevent from shifting to paperless society. This is only my opinion. I never talked to others on this who think differently of course.

Thats not necessary to convert anything to UTF-8, if you have only a single
script aka language inside your text, since most browsers understand for
example SHIFT-JIS. If you encode this here into SHIFT-JIS file:

<html>
   <head>
       <meta http-equiv="content-type" content="text/html; charset=Shift_JIS">
       <title></title>
   </head>
   <body>
     幽霊です
  </body>
</html>

It shows nicely in the browser, although the browser processor character set
is usually Unicode, in particular the browser JavaScript strings are UTF-16,
it nevertheless shows, since the charset is a file system thingy:

image

Thats the file I used:
test.html.log (243 Bytes)

Edit 24.06.2022:
Externally you might also prefer UTF-16 over UTF-8, since
UTF-16: “Bad for English text, good for Asian text.”:

image

Source of this claim, I would rather prefer some scientific statistics, but anyway:

XML 1.1 Bible, 3rd Edition
https://www.wiley.com/en-us/XML+1+1+Bible%2C+3rd+Edition-p-9780764549861

I prefer utf-8 to utf-16 without such statistics. Perhaps I do so because Apple supports utf-8.

BTW a kanji 熊 (bear) is a trauma for me. About 30 years ago, I took a long time for debugging on Prolog codes which simply reads a file, and found finally that Kanj code contains an ascii control code which stopped the program. It was EUC or Shift-JIS encoding, I don’t remember which. I never thought such kind of errors caused by text encoding. Then I heard that UTF-8 has no such problem, which is a simple but main reason that I prefer UTF-8.

Maybe you mean the locale and not the file format?
A UTF-8 based locale is again something else. You can
have UTF-8 based locale and yet use a variety of

file formats on most if not all computers. Seems to be also
applicable to mac, you can store UTF-16 or SHIFT-JIS:

UTF-16 and UTF-8 do store the same Unicode code points, you
wont see any difference, only different file size. Try it!

Disclaimer: I don’t know your tool chains, so my advice might
not work, be careful, make backups before converting stuff.

Edit 24.06.2022:
Here is a test, the plain text from the Japanese Wikipedia page on Prolog
programming language. The compression doesn’t work so good if it is the
HTML, thats why UTF-8 might be prevalent on the web.

But for the plain text one quickly sees a size reduction. The Wikipedia page
has also some latin text in it. Maybe a page with only Japanese shows
a higher compression. Here are the results:

UTF-8 encoded: 38’191 Bytes
UTF-16 encoded: 29’392 Bytes

These are the files I were using:

prolog.jp.utf8.txt.log (37,3 KB)

prolog.jp.utf16.txt.log (28,7 KB)

Thanks for useful comments. By saying that Apple supprts utf-8, I meant that Apple default encoding is utf-8. By “default encoding” I meant that Apple products e.g. TextEdit.app accepts utf-8 file, but I suspect, it does not accept utf-16 file without showing a window on changing encoding, though I have not tested any, but based on experience on macOS. Unfortunately, I have no occasion to touch Windows OS. Also by default any defualt text files are in utf-8 including iOS, iPadOS, and macOS. I guess it is Apple’s policy for its recent universal access functionality.

In general, it’s a mistake to depend on language features for normalizing data (even when setting the localisation environment). I would expect a Japanese-oriented application to normalize full-width numbers to ASCII (e.g., 123 would be normalized to 123) and also to normalize 半角 (han-kaku) to 全角 (zen-kaku) (e.g. ハンカカク to ハンカク), and probably a lot of other normalizations. Presumably there are libraries for handling this. (Normalization rules for other languages have their own nuances; for example Québec and France have slightly different rules for capitalizing a name with accents, such as “Lenôtre”.)

I fully agree with your suggestion. In fact I have been prcticing in that way since a long time ago. I believe also most people are doing so. In fact, for example, it was widely advised by network administrators not to use half-width characters in the mail title, because of high risk on troubles on send/receiveing such series of chars.
There might be many R & D topics related to encoding, which might be still active really. However I escaped from such important and heavy issues. I hope so as I was close to them in the sense of “social distance”. Charcter encoding did not look attractive for me, but now it may be more important technology than I thought. Meta font technology might concern glyphs, for example.

Nope, it will accept an UTF-16 file without some manipulation, and you
don’t need to change the encoding to UTF-8. This works especially
well in case the UTF-16 file has as a first character a BOM. The BOM

is then used to detect the particular type of encoding, for example the
endianness. BOM should be also supported by SWI-Prolog, from within
the open/4 predicate. There are a couple of other Prolog systems that

support it as well. See also here:

2.19.1.1 BOM: Byte Order Mark
BOM or Byte Order Marker is a technique for identifying
Unicode text files as well as the encoding they use.
SWI-Prolog -- BOM: Byte Order Mark

bom(Bool)
the default is true for mode read
https://www.swi-prolog.org/pldoc/doc_for?object=open/4

Disclaimer: Of course you might run into problems if some tool in your tool
chain cannot detect, handle or generate BOMs, so be careful.

Edit 25.06.2022:
The BOM itself is just a Unicode code point, that gets different
byte patterns depending on the encoding. The Unicode code point
has even an entry in the Unicode database:

Zero Width No-Break Space (BOM, ZWNBSP)
https://www.compart.com/de/unicode/U+FEFF

I even read now:

Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII.
https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

So you might find a BOM also for UTF-8 files, not only UTF-16 and other encodings.
The usual heuristic is to choose UTF-8 if there is no BOM, but the above suggests
to throw an error if there is no BOM and some non-ASCII, i.e. > 7 bit.

Don’t know if any Prolog system throws such an error. One could make the
input stream aware that there was no BOM, and then bark on a > 7 bit character
coming from the stream. Could give more security.

Makes me currious to check what Java, JavaScript and Python do, whether
they have some reader object with such a feature. In the worst case one could
realize this feature in the Prolog systems streams itself.

Again bom(Bool) is not mentioned in the ISO Prolog core standard, but
it has already some support across Prolog systems. For example in
formerly Jekejeke Prolog I have the same, not sure what newer Prolog

systems such as Scryer Prolog provide. Need to check.

Edit 25.06.2022:
But I don’t know what mac i.e. apple does. The above is from windows.
This guy here discusses the issue also at length for mac i.e. apple.

Source:

Step into Xcode: MAC OS X Development
https://www.amazon.com/dp/0321334221

TexEdit seems to open both files in utf8 and utf16 without warings as you told. Thanks.

% echo  漢字 > a 
% nkf -w16 a > b
% open -a TextEdit a
% open -a TextEdit b

Do you see what you want. I see something without meaning :sweat_smile:

% hexdump a
0000000 bce6 e5a2 97ad 000a                    
0000007

% hexdump b
0000000 fffe 226f 575b 0a00                    
0000008
%

Just go here, see screenshot below. Now we find:

File a: Has no BOM, you created it with echo.

File b: Has a BOM, FF FE, thats the UTF-16LE BOM.

https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

Unbenannt

Roughly, that would require making something like libiconv or librecode part of the system. While this is typically installed already on most Unix like systems, it is a big new dependency, about as big as SWI-Prolog itself :frowning:

When I added Unicode support I went for built-in support for ISO Latin 1, UTF-8 and USC-2 (big and little endian). In addition there is the text encoding, which uses whatever the locale tells the multibyte routines of the C runtime to support. So, you have always access to files in your default locale, UTF-8 and USC-2 files. As fairly all non-Windows system already use UTF-8 as their primary encoding, this seems the right direction. Windows seems to be moving in the same direction.

Once upon a time UCS-2 should be extended to UTF-16, i.e. we must support surrogate pairs. As is, this is not that useful as we find UTF-16 mostly on Windows and SWI-Prolog does not support code points > 0xffff on Windows.