Trouble with Unicode

rvjansen · September 21, 2021, 11:19am

I’m using: SWI-Prolog version 8.2.4

I want the code to: work

But what I’m getting is: lots of errors: Operator Expected

My code looks like this:

'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E9818360-2303-11E3-ACBE-E0F847277696',capaç).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E982BBE0-2303-11E3-ACBE-E0F847277696',competent).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E9844280-2303-11E3-ACBE-E0F847277696',entitat).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E9857B00-2303-11E3-ACBE-E0F847277696',alenar).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E986DA90-2303-11E3-ACBE-E0F847277696',espirar).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E9881310-2303-11E3-ACBE-E0F847277696',respirar).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E9892480-2303-11E3-ACBE-E0F847277696',dC).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E98A5D00-2303-11E3-ACBE-E0F847277696','DC').
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E98BBC90-2303-11E3-ACBE-E0F847277696',impossibilitat).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E98D9150-2303-11E3-ACBE-E0F847277696',incapaç).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E98EA2C0-2303-11E3-ACBE-E0F847277696',abstracció).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E9907780-2303-11E3-ACBE-E0F847277696',abaxial).

This is from a mothballed dictionary project that I decided to un-mothball. Due to horrible install problems I migrated from M1 Mac to Ubuntu 20.04; I installed from the recommended SWI repo.

This (very large now, 808M) file suddenly spews errors. I used this, if I remember well, in the SWI Prolog 5 and 6 era. It was fine then, on the Mac. I do a qcompile of it, which speeds up things immeasurable. While doing this, lots and lots and lots of errors show up (one example only):

ERROR: /home/rvjansen/papiamento/src/facts.pl:6285725:81: Syntax error: Operator expected

Looking at those lines with awk, the conclusion presents itself as Unicode characters not being supported (anymore) as parts of an atom. Can anybody help?

best regards,

René Jansen.

Boris · September 21, 2021, 11:32am

Maybe there are problems with the encoding? Are the files you have in UTF-8? Can you actually just get this one line that has the error you show, put it in a file on its own, and reproduce the error? That would make it easier to troubleshoot.

Because Unicode source is definitely fully supported. If I just copy the snippet as you showed above and paste it into a file, it loads (compiles) without any issues.

rvjansen · September 21, 2021, 11:43am

well, that was it. But to minimalise it even more:

'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E98D9150-2303-11E3-ACBE-E0F847277696',incapaç).

and look at that last word.

Boris · September 21, 2021, 11:47am

It works for me when I copy-paste to a file. Here is what I see with listing:

'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E9818360-2303-11E3-ACBE-E0F847277696', capaç).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E982BBE0-2303-11E3-ACBE-E0F847277696', competent).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E9844280-2303-11E3-ACBE-E0F847277696', entitat).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E9857B00-2303-11E3-ACBE-E0F847277696', alenar).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E986DA90-2303-11E3-ACBE-E0F847277696', espirar).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E9881310-2303-11E3-ACBE-E0F847277696', respirar).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E9892480-2303-11E3-ACBE-E0F847277696', dC).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E98A5D00-2303-11E3-ACBE-E0F847277696', 'DC').
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E98BBC90-2303-11E3-ACBE-E0F847277696', impossibilitat).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E98D9150-2303-11E3-ACBE-E0F847277696', incapaç).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E98EA2C0-2303-11E3-ACBE-E0F847277696', abstracció).
'ADFD9D7F-A05F-4DB3-8B40-C7813D3900DC'('E9907780-2303-11E3-ACBE-E0F847277696', abaxial).

I now copied your snippet, pasted it to a file, saved this file, then loaded it and used listing, then copied the output back in here. Can you check the encoding somehow?

EDIT: if I just copy-paste incapaç to a file called bar, I see:

$ cat bar 
incapaç
$ hexdump bar
0000000 69 6e 63 61 70 61 c3 a7 0a                     
0000009

The ç character is the c3 a7 in here.

rvjansen · September 21, 2021, 11:55am

Thanks Boris, will have a look at the encoding, will let you know.

rvjansen · September 21, 2021, 12:06pm

yes, this is how it should be, U+00E7 ç c3 a7 LATIN SMALL LETTER C WITH CEDILLA
and this is how the line shows on my system:

87654321  0011 2233 4455 6677 8899 aabb ccdd eeff  0123456789abcdef
00000000: 2741 4446 4439 4437 462d 4130 3546 2d34  'ADFD9D7F-A05F-4
00000010: 4442 332d 3842 3430 2d43 3738 3133 4433  DB3-8B40-C7813D3
00000020: 3930 3044 4327 2827 4539 3844 3931 3530  900DC'('E98D9150
00000030: 2d32 3330 332d 3131 4533 2d41 4342 452d  -2303-11E3-ACBE-
00000040: 4530 4638 3437 3237 3736 3936 272c 696e  E0F847277696',in
00000050: 6361 7061 c3a7 292e 0a                   capa..)..

If UTF-8 and Unicode are supported, and this works on your system, there must be something wrong with this Ubuntu (on an Intel NUC-7, twice upgraded 16->18, 18->20). Back to my Mac.

jan · September 21, 2021, 12:16pm

The encoding of the file must match the default locale or the file must contain a (correct) :- encoding(...). directive at the start. It looks like the file has UTF-8 encoding. Typically, modern Linux systems also default to an UTF-8 based locale. Maybe not in this case?
Try running locale in the shell. That should say something like LANG=…UTF-8. Also Prolog should answer on

?- current_prolog_flag(encoding, X).
X = utf8.

If the locale is wrong install the desired locale and set your LANG environment variable right.

Alternatively you can use the recode utility to recode the file to the current locale. In these days you typically want your environment to use UTF-8 all over though.

rvjansen · September 21, 2021, 12:18pm

aaargh! yes, that is it.

?- current_prolog_flag(encoding,X).
X = iso_latin_1.

Thanks, Jan and Boris!

jan · September 21, 2021, 12:22pm

That points at a wrong locale. The encoding flag is initalized from LC_CTYPE or LANG. If this ends with UTF-8, utf8 is assumed. If it is a known alias for ISO Latin 1,
this is assume and else it is set to text, which causes Prolog to fallback to the C runtime library to translate input to Unicode.

jan · September 21, 2021, 1:06pm

I compiled it on an M1. At least the current development version compiles fine after installing the dependencies using Macports. You can deduce the build process and dependencies from the Portfile. I was impressed by the performance. It is about twice as fast as the GCC compiled version on my old Macbook air. AFAIK you cannot yet use GCC on the M1. Once you can, I’m curious about the results. At least on Intel, GCC produces much faster code for SWI-Prolog than Clang.

I don’t know whether 8.2 compiles on the M1. It will be updated to 8.4 shortly anyway.

rvjansen · September 21, 2021, 1:24pm

I will try that certainly, soon. As I use NetRexx with JPL, that last component is a bit of a bottleneck. Brew does, for some reason not compile and install JPL; Ports fails with a build error on db53. The macOS package installer is X64_64, and works well with Rosetta; only calling from Java into that dylib fails with a ‘wrong architecture’ error, for some reason Rosetta does not seem to support that (I have Zulu Java for ARM64). So I thought Linux would be faster, but I encountered an error there, which you thankfully diagnosed and which is fixed now. When 8.4 comes out, I’ll build that on the M1.

Topic		Replies	Views
Encoding problem Help!	9	305	May 11, 2023
Issue with processing Unicode General bug	15	233	October 31, 2023
Unicode symbols, and a possible numerical paradox! Discussion	46	2172	June 27, 2022
Unicode code point for U+10000 on Windows OS. Use in string results in Illegal character code Help!	5	1652	March 13, 2021
Swipl exits silently when loading file from a Unicode path in a container Help!	7	383	September 6, 2022

Trouble with Unicode

Related topics