Issue with processing Unicode

Hello,

I get some issues with processing Unicode from this chineese dictionary: https://handedict.zydeo.net/de/download (handedict.u8.gz) within this predicate:

kmp_isub(T1,T2,Simil) :-
    catch( % okazas iuj nevalidaj unikod-literoj !?
        (
            downcase_atom(T1,S1),
            downcase_atom(T2,S2),
            isub(S1,S2,Simil,[zero_to_one(true)])
        ), 
        E,
        (
            writeln(E),
            zhde(Zh,T2),
            writeln(Zh),
            Simil = 0.0
        )
    ),
    Simil > 0.5 .
Thread 1 (main): foreign predicate downcase_atom/2 did not clear exception: 
        error(representation_error(code_point),context(system:downcase_atom/2,_4458584))
error(instantiation_error,context(isub: $isub/5,_4458638))
〥 〥 [wu3] 
Thread 1 (main): foreign predicate downcase_atom/2 did not clear exception: 
        error(representation_error(code_point),context(system:downcase_atom/2,_4466104))
error(instantiation_error,context(isub: $isub/5,_4466158))
〤 〤 [si4] 

I suppose that the issue might be within isub, as downcase_atom on one those lines is working ok:

?- downcase_atom('〥 〥 [wu3] /5 (Num) (im Suzhou Zahlenystem 蘇州碼子|苏州码子[su1 zhou1 ma3 zi5])/',A).
A = '〥 〥 [wu3] /5 (num) (im suzhou zahlenystem 蘇州碼子|苏州码子[su1 zhou1 ma3 zi5])/'.
echo '〥 〥 [wu3] /5 (Num) (im Suzhou Zahlenystem 蘇州碼子|苏州码子[su1 zhou1 ma3 zi5])/'|hexdump -C
00000000  e3 80 a5 20 e3 80 a5 20  5b 77 75 33 5d 20 2f 35  |... ... [wu3] /5|
00000010  20 28 4e 75 6d 29 20 28  69 6d 20 53 75 7a 68 6f  | (Num) (im Suzho|
00000020  75 20 5a 61 68 6c 65 6e  79 73 74 65 6d 20 e8 98  |u Zahlenystem ..|
00000030  87 e5 b7 9e e7 a2 bc e5  ad 90 7c e8 8b 8f e5 b7  |..........|.....|
00000040  9e e7 a0 81 e5 ad 90 5b  73 75 31 20 7a 68 6f 75  |.......[su1 zhou|
00000050  31 20 6d 61 33 20 7a 69  35 5d 29 2f 0a           |1 ma3 zi5])/.|

Guess this is on Windows? A quick test reveals that downcase_atom/2 is not updated to handle UTF-16 and its error handling is broken. I’ll have a look.

1 Like

it is Ubuntu 20.04:

$ swipl --version
SWI-Prolog version 9.0.4 for x86_64-linux

Thanks. Closer looks tells me UTF-16 is properly handled (and irrelevant on Ubuntu). I pushed 090b6216128d7097267790c289ea8c87d23605fd (ref) to swipl-devel.git to address the error propagation. A Unicode code point representation error is rather weird on anything but Windows. Using the above, I ran p/0 from this code:

p :-
    setup_call_cleanup(
        gzopen('handedict.u8.gz', read, In),
        process(In),
        close(In)).

process(In) :-
    read_line_to_string(In, Line),
    process(Line, In).

process(end_of_file, _) :-
    !.
process(Line, In) :-
    downcase_atom(Line, _),
    read_line_to_string(In, Line2),
    process(Line2, In).

That runs just fine. Note that downcase_atom/2 uses the C runtime function towlower() which depends on the current locale.

Ideally, compile the current development version from source. If that is too much for you, you could also try the development PPA. If the problem persists there, please share the program so I can run the same.

So far, I don’t think it is isub/4 because the ignored exception comes from downcase_atom/2.

1 Like

Ok, investigating a bit more (I removed all comments and empty lines from the original file and read it in as csv_read_file, with ‘/’ as separator) - this works fine for all lines but those where german an chineese letters are mixed in the second column…

in fact looking on the encoding of 〩 〩 [jiu3] /9 (Num) (im Suzhou Zahlensystem 蘇州碼子|苏州码子[su1 zhou1 ma3 zi5])/ looks a bit weird and also gives an error for writeln:

hande('〩 〩 [jiu3] ',D,E).
D = '\U4E282039\U20296D75\U206D6928\U687A7553\U5A20756F\U656C6861\U7379736E\U206D6574\uE58798E8\uA2E79EB7\u90ADE5BC\u8F8BE87C\uE79EB7E5\uADE581A0\U75735B90\U687A2031\U2031756F\U2033616D\U5D35697A\U69650029\U696A2031\U69782032\U32676E61\U34757720\U69687320\U697A2031\U00205D33\U6F616863\U00205D32啞\U0048218D\u0000\U00037905\u0000\uD7772950啞\U0048318D\u0000\U00037985\u0000\uD7772DA0啞\U0048408D\u0000\U00037A05\u0000\uD7772EB0啞\U0048528D\u0000\U00037A85\u0000𗮅\u0000\uD7773180啞\U0048610D\u0000\U00037B05\u0000犅\u0000\U0007C10D\u0000隅\u0000\U0048708D\u0000\U00037B85\u0000龅\u0000\U0048808D\u0000\uD7773690啞\U0048910D',
E = ''.

Seems the encoding issues comes from reading as CSV. I put a small test here: https://github.com/revuloj/voko-efemero/blob/405b59895eb2d95a61cf6f4a5f1e70beb1e472a2/pro/zhtest.pl

Indeed looks weird, but it is hard to say without knowing what hande/3 does … The input text seems represented fine:

?- atom_codes('〩 〩 [jiu3] ', L).
L = [12329, 32, 12329, 32, 91, 106, 105, 117, 51|...].

Please also add the output of running locale in the Ubuntu shell.

locale ist UTF-8:

locale
LANG=de_DE.UTF-8
LANGUAGE=
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=

Thanks, but would you care sharing hande/3 and/or test.csv? You gave a lot of incomplete code. Life is much easier with something I can simply load and run that is as close as possible to what you have.

1 Like

see the provided test script link to github in my previous mail… the encoding issues seems to happen somewhere here during reading CSV:

 Exit: (26) csv:field_codes([57, 32, 40, 78, 117, 109, 41|...], 47, [57, 32, 40, 78, 117, 109, 41|...], [47, 10|_708{pure_input = ...}]) ? creep
   Call: (26) csv:make_value([57, 32, 40, 78, 117, 109, 41|...], _764, csv_options(47, false, true, true, preserve, row, _86, false, #)) ? skip
   Exit: (26) csv:make_value([57, 32, 40, 78, 117, 109, 41|...], '\U4E282039\U20296D75\U206D6928\U687A7553\U5A20756F\U656C6861\U7379736E\U206D6574\uE58798E8\uA2E79EB7\u90ADE5BC\u8F8BE87C\uE79EB7E5\uADE581A0\U75735B90\U687A2031\U2031756F\U2033616D\U5D35697A)\u0000\u0000\a\u0000\U3BFFD1FE翹\U000B6F85\u0000\U3BFFD427翹ƃ\u0000\U3BFFD1FE翹\U00014185\u0000\U3BFFCCE0翹A\u0000\uFC13E760嘊\uFC13E5C0嘊 \u0000 \u0000\u9CBC4E1E嘏\uFC147C60嘊Ā\u0000\u0080\u0000\u9CBC237E嘏Ὶ\u0000”\u0000\u0003\u0003\u0003\u0000\u0000\u0000\u0000\u0000\a\u0000\U3BFFD1FE翹\U00035185\u0000\U3BFFD427', csv_options(47, false, true, true, preserve, row, _86, false, #)) ? 

Kind regards,
Wolfram.

Hello Jan,

ok, I found it: I had to add convert(false)to CSV-options, as it happens that the field in this case is starting with a ‘9’ and read_csv tries to convert the whole string to a number using name().

So seams it was a simple bug in my code, but still, not sure if a failed conversion should silently create a wrongly encoded atom…

Kind regards,
Wolfram.

That is only the script. The file zhtest.csv is missing.

It looks likely that you stumbled on some bug. As I can’t reproduce your results, it is hard to guess what is going wrong. The original result mentions a code point representation error. On Windows that could be some UTF-16 issue, but on other systems it means something created a code point outside the Unicode range. That means either some conversion is broken or somehow we process garbage. That would be a serious bug.

But, I need the Prolog code as well as the data to reproduce this. Of course you do not need to share all your code, but the code must be complete it the sense that it does not depend on stuff that is not included so I can actually run it.

Hi Jan,

The testfile is in the same directory:

But I hunted it down now to:

?- atom_codes('9 (Num) (im Suzhou Zahlensystem 蘇州碼子|苏州码子[su1 zhou1 ma3 zi5]',C),name(N,C).
C = [57, 32, 40, 78, 117, 109, 41, 32, 40|...],
N = '\U4E282039\U20296D75\U206D6928\U687A7553\U5A20756F\U656C6861\U7379736E\U206D6574\uE58798E8\uA2E79EB7\u90ADE5BC\u8F8BE87C\uE79EB7E5\uADE581A0\U75735B90\U687A2031\U2031756F\U2033616D\U5D35697A\U20317500\U2033616D\U5D35697A\U29432C27\U6D616E2C\U2C4E2865\U002E2943\U54344158P\u8CEB38F0嘌\U0048218D\u0000\U00037905\u0000\u8CEB3A00嘌\U0048318D\u0000\U00037985\u0000\u8CEB3E50嘌\U0048408D\u0000\U00037A05\u0000\u8CEB3F60嘌\U0048528D\u0000\U00037A85\u0000𗮅\u0000\u8CEB4230嘌\U0048610D\u0000\U00037B05\u0000犅\u0000\U0007C10D\u0000隅\u0000\U0048708D\u0000\U00037B85\u0000龅\u0000\U0048808D\u0000\u8CEB4740嘌'.

Really weird easter egg from Edinburg…

Regards,
Wolfram.

That is what we want to see :slight_smile: I can reproduce this and it is clearly wrong. N should be the input atom.

Ok. Pushed a fix that also fixes some issues wrt code lists holding code 0 and handling of high (>0xffff) code points on Windows for name/2.

1 Like

Cool, that was fast. Long time passed since I was able to write working C code …
Wish you a good rest of the week.