Writing facts including unicode to file

tobi · March 12, 2021, 8:57am

Hello,

I’m trying to convert a xml-file (odenet, german wordnet-alike, utf8-encoded) into a prolog database.
I use load_xml/3 to read the file and convert the tree to facts (with assert), one fact per word entry.
Saving the facts (tell, listing, told) works, but reconsulting the file fails because some
word entries use non-ascii characters (like greek characters or phonetics alphabet).
I included the encoding(utf8) directive in my program as well as in the saved database.
Consulting fails because multiple tics(’) are used in some utf8 encoded chars.

For example (from odenet entry “Tenor”):
“PHON:ˈteːnoːɐ̯” in the xml-file yields
'PHON:‘te\u02D0no\u02D0\u0250\u032F’ in the prolog database
^
this tic causes syntax error when consulting prolog file

The interpreter shows no problems when calling those facts from memory (i.e. after asserting them),
perhaps something went wrong when writing those facts to the file? Does tell/told work with utf8?

Thank You

Boris · March 12, 2021, 9:19am

The discourse markdown processing is mangling your example. You need to use “code fences”, like this:

```
this will be printed verbatim
```

tobi · March 12, 2021, 11:16am

OK. I just wanted to abreviate the xml/prolog code: each xml-entry and so each prolog fact is quite large:

XML:

<LexicalEntry id="w4007" confidenceScore="1.0">
        <Lemma writtenForm="Tenor" partOfSpeech="n"/>
        <Sense id="w4007_811-n" synset="odenet-811-n" note="PHON:ˈteːnoːɐ̯"/>
        <Sense id="w4007_15099-n" synset="odenet-15099-n" note="PHON:teˈnoːɐ̯"/>
        <Sense id="w4007_34494-n" synset="odenet-34494-n" note="PHON:ˈteːnoːɐ̯"/>
</LexicalEntry>

the saved-in-file prolog database which can’t be consulted looks like this:

element('LexicalEntry', [id=w4007, confidenceScore='1.0'], 
           [element('Lemma', [writtenForm='Tenor', partOfSpeech=n], []), 
            element('Sense', [id='w4007_811-n', synset='odenet-811-n', note='PHON:'te\u02D0no\u02D0\u0250\u032F'], []), 
            element('Sense', [id='w4007_15099-n', synset='odenet-15099-n', note='PHON:te'no\u02D0\u0250\u032F'], []), 
            element('Sense', [id='w4007_34494-n', synset='odenet-34494-n', note='PHON:'te\u02D0no\u02D0\u0250\u032F'], [])]).

But when asserting the xml-file directly to prolog facts, the database (in memory) can be accessed correctly:

?- element('LexicalEntry', [id=w4007, confidenceScore='1.0'], SubTree).

SubTree = [element('Lemma', [writtenForm='Tenor', partOfSpeech=n], []), 
           element('Sense', [id='w4007_811-n', synset='odenet-811-n', note='PHON:ˈteːnoːɐ̯'], []), 
		   element('Sense', [id='w4007_15099-n', synset='odenet-15099-n', note='PHON:teˈnoːɐ̯'], []), 
		   element('Sense', [id='w4007_34494-n', synset='odenet-34494-n', note='PHON:ˈteːnoːɐ̯'], [])] ;

tobi · March 12, 2021, 12:50pm

ERROR: [Thread pce] c:.../odenet_database.pro:3999:192: Syntax error: Operator expected

and four other errors of the same kind (additional tic in utf8 string).

Looking for line 3999, column 192(see inline comment):

element('LexicalEntry', [id=w4007, confidenceScore='1.0'], 
           [element('Lemma', [writtenForm='Tenor', partOfSpeech=n], []), 
            element('Sense', [id='w4007_811-n', synset='odenet-811-n', note='PHON:'te\u02D0no\u02D0\u0250\u032F'], []), 
            element('Sense', [id='w4007_15099-n', synset='odenet-15099-n', note='PHON:te'no\u02D0\u0250\u032F'], []), 
			                                                                          % ^ this additional tic causes syntax error
			
            element('Sense', [id='w4007_34494-n', synset='odenet-34494-n', note='PHON:'te\u02D0no\u02D0\u0250\u032F'], [])]).

This database includes 120.000 word entries, only five of them cause trouble.
So for a workaround I could correct them manually. But if odenet keeps growing this is no proper solution…

tobi · March 12, 2021, 3:46pm

This works manually.
So I need a clause that replaces a single quote with a double quote. Might be time consuming for 23MB
file, though. Ok, have to find an elegant way to do this.
Thanks

tobi · March 12, 2021, 8:49pm

Thanks for taking all the trouble.
Btw: Windows 10, SWI-Prolog 8.2.4
I radically condensed the xml-file. Suppose we have a file miniodenet.xml:
(lines are very long, sorry…)

<?xml version="1.0" encoding="UTF-8"?>
<LexicalResource xmlns:dc="http://purl.org/dc/elements/1.1/" >
<Lexicon id="minimal_odenet" >
<LexicalEntry id="w8"><Lemma writtenForm="wegfahren" partOfSpeech="v"/><Sense id="w8_3-v" synset="odenet-3-v"></Sense></LexicalEntry>
<LexicalEntry id="w4007" confidenceScore="1.0"><Lemma writtenForm="Tenor" partOfSpeech="n"/><Sense id="w4007_811-n" synset="odenet-811-n" note="PHON:ˈteːnoːɐ̯"/><Sense id="w4007_15099-n" synset="odenet-15099-n" note="PHON:teˈnoːɐ̯"/><Sense id="w4007_34494-n" synset="odenet-34494-n" note="PHON:ˈteːnoːɐ̯"/></LexicalEntry>
<LexicalEntry id="w67853"><Lemma writtenForm="π" partOfSpeech="n"/><Sense id="w67853_18808-n" synset="odenet-18808-n"/></LexicalEntry>
</Lexicon>
</LexicalResource>

and a prolog source odenet.pro:

:- encoding(utf8).
:- dynamic element/3.

go:-
	load_xml('miniodenet.xml',Tree,[]),
	% extracting list of entries 
	[element(_,_,[_,element(_,_,EntryList),_])] = Tree, 
	maplist(to_fact,EntryList).
	
		
to_fact('\n'):- !.     % skipping newlines
to_fact(E):- assert(E).	

save:- tell('miniodenet.pro'), listing(element), told.

Executing prolog:

?- go.
?- save.
?- listing(element).

:- dynamic element/3.

element('LexicalEntry', [id=w8], [element('Lemma', [writtenForm=wegfahren, partOfSpeech=v], []), element('Sense', [id='w8_3-v', synset='odenet-3-v'], [])]).
element('LexicalEntry', [id=w4007, confidenceScore='1.0'], [element('Lemma', [writtenForm='Tenor', partOfSpeech=n], []), element('Sense', [id='w4007_811-n', synset='odenet-811-n', note='PHON:ˈteːnoːɐ̯'], []), element('Sense', [id='w4007_15099-n', synset='odenet-15099-n', note='PHON:teˈnoːɐ̯'], []), element('Sense', [id='w4007_34494-n', synset='odenet-34494-n', note='PHON:ˈteːnoːɐ̯'], [])]).
element('LexicalEntry', [id=w67853], [element('Lemma', [writtenForm=π, partOfSpeech=n], []), element('Sense', [id='w67853_18808-n', synset='odenet-18808-n'], [])]).

Everything looks fine.
But saved database in miniodenet.pro looks like:

:- dynamic element/3.

element('LexicalEntry', [id=w8], [element('Lemma', [writtenForm=wegfahren, partOfSpeech=v], []), element('Sense', [id='w8_3-v', synset='odenet-3-v'], [])]).
element('LexicalEntry', [id=w4007, confidenceScore='1.0'], [element('Lemma', [writtenForm='Tenor', partOfSpeech=n], []), element('Sense', [id='w4007_811-n', synset='odenet-811-n', note='PHON:'te\u02D0no\u02D0\u0250\u032F'], []), element('Sense', [id='w4007_15099-n', synset='odenet-15099-n', note='PHON:te'no\u02D0\u0250\u032F'], []), element('Sense', [id='w4007_34494-n', synset='odenet-34494-n', note='PHON:'te\u02D0no\u02D0\u0250\u032F'], [])]).
element('LexicalEntry', [id=w67853], [element('Lemma', [writtenForm=p, partOfSpeech=n], []), element('Sense', [id='w67853_18808-n', synset='odenet-18808-n'], [])]).

Obviously, prolog interprets the primary stress mark (ˈ) (some kind of phonetic symbol) as a quote (’).

?- consult('miniodenet.pro').
ERROR: c:/......../prolog/miniodenet.pro:4:1: Syntax error: End of file in quoted atom

Boris · March 13, 2021, 6:14am

This doesn’t reproduce for me. The file miniodenet.pro looks fine, there are no escape sequences. But I see this:

?- current_prolog_flag(encoding, X).
X = utf8.

What does it say for you?

On the other hand locale/1 doesn’t give me anything too useful

?- current_locale(X).
X = default.

I wouldn’t have used tell/1 and so on, but still, with my setup (locale, encoding flag etc) the saved file looks fine:

$ cat miniodenet.pro 
:- dynamic element/3.

element('LexicalEntry', [id=w8], [element('Lemma', [writtenForm=wegfahren, partOfSpeech=v], []), element('Sense', [id='w8_3-v', synset='odenet-3-v'], [])]).
element('LexicalEntry', [id=w4007, confidenceScore='1.0'], [element('Lemma', [writtenForm='Tenor', partOfSpeech=n], []), element('Sense', [id='w4007_811-n', synset='odenet-811-n', note='PHON:ˈteːnoːɐ̯'], []), element('Sense', [id='w4007_15099-n', synset='odenet-15099-n', note='PHON:teˈnoːɐ̯'], []), element('Sense', [id='w4007_34494-n', synset='odenet-34494-n', note='PHON:ˈteːnoːɐ̯'], [])]).
element('LexicalEntry', [id=w67853], [element('Lemma', [writtenForm=π, partOfSpeech=n], []), element('Sense', [id='w67853_18808-n', synset='odenet-18808-n'], [])]).

HOWEVER, as @j4n_bur53 writes, seemingly, the tick between “PHON:” and “te” turns into a quotation mark for you (not for me! and for @j4n_bur53 he gets an error?).

But to resolve your current problem, I agree, try using ISO Input/Output instead of tell/1 and maybe try setting:

?- set_prolog_flag(encoding, utf8).

On a side note, I was under the impression that the :- encoding(utf8) directive that you use doesn’t affect file reading and writing, it affects only how the Prolog compiler reads the source file that contains that directive? Not sure.

Boris · March 13, 2021, 4:19pm

No, I don’t. This almost certainly is caused by the OS locale, somehow. I don’t know how to do this on Windows. On the machine I am using right now, I can use the locale command and it tells me I am using “en_US.UTF-8”.

tobi · March 13, 2021, 9:37pm

Ok write_term/3 works for me:

On the toy example:

:- encoding(utf8).

go2:-
	load_xml('miniodenet.xml',Tree,[]),	
	% extracting list of entries 
	[element(_,_,[_,element(_,_,EntryList),_])] = Tree,     
	open('miniodenet.pro', write, Stream, []),
	write(Stream,':- encoding(utf8).\n'),
	maplist(to_stream(Stream),EntryList),
    close(Stream).

to_stream(_, '\n'):- !.     % skipping newlines
to_stream(Stream, Element):- write_term(Stream, Element, [fullstop(true), nl(true), quoted(true)]).

Executing:

?- go2.

content of miniodenet.pro:

:- encoding(utf8).
element('LexicalEntry',[id=w8],[element('Lemma',[writtenForm=wegfahren,partOfSpeech=v],[]),element('Sense',[id='w8_3-v',synset='odenet-3-v'],[])]).
element('LexicalEntry',[id=w4007,confidenceScore='1.0'],[element('Lemma',[writtenForm='Tenor',partOfSpeech=n],[]),element('Sense',[id='w4007_811-n',synset='odenet-811-n',note='PHON:ˈteːnoːɐ̯'],[]),element('Sense',[id='w4007_15099-n',synset='odenet-15099-n',note='PHON:teˈnoːɐ̯'],[]),element('Sense',[id='w4007_34494-n',synset='odenet-34494-n',note='PHON:ˈteːnoːɐ̯'],[])]).
element('LexicalEntry',[id=w67853],[element('Lemma',[writtenForm=π,partOfSpeech=n],[]),element('Sense',[id='w67853_18808-n',synset='odenet-18808-n'],[])]).

‘PHON:ˈteːnoːɐ̯’ is now correct…

And on the big xml-file (odenet/odenet_oneline.xml at 527a326b01eeb47c3180860048aedec6f9c7ae22 · hdaSprachtechnologie/odenet · GitHub , if you want to play),
I had to adjust the to_stream clause, otherwise there are lots of ‘\n’ - clauses in the file:

:- encoding(utf8).

go3:-	
	load_xml('odenet_oneline.xml',Tree,[]),
	% extracting list of entries 	
	[element(_,_,[_,element(_,_,EntryList),_])] = Tree,     
	open('odenet_oneline.pro', write, Stream, []),
	write(Stream,':- encoding(utf8).\n'),
	maplist(to_stream(Stream),EntryList),
    close(Stream).

to_stream(Stream, element(X,Y,Z)):- write_term(Stream, element(X,Y,Z), [fullstop(true), nl(true), quoted(true)]), !.
to_stream(_, _). % skipping newlines, tabs, multiple newlines...

Consulting works:

?- consult('odenet_oneline.pro').
true.

Thank you very much!!

Boris · March 14, 2021, 8:09am

I finally got to try this on my (work) Mac. First, I still cannot reproduce the problem that you @j4n_bur53 have, or the one that @tobi has. I took the XML as it is, the code as it is, ran it and this is what I see when I cat the generated file:

% cat miniodenet.pro 
:- dynamic element/3.

element('LexicalEntry', [id=w8], [element('Lemma', [writtenForm=wegfahren, partOfSpeech=v], []), element('Sense', [id='w8_3-v', synset='odenet-3-v'], [])]).
element('LexicalEntry', [id=w4007, confidenceScore='1.0'], [element('Lemma', [writtenForm='Tenor', partOfSpeech=n], []), element('Sense', [id='w4007_811-n', synset='odenet-811-n', note='PHON:ˈteːnoːɐ̯'], []), element('Sense', [id='w4007_15099-n', synset='odenet-15099-n', note='PHON:teˈnoːɐ̯'], []), element('Sense', [id='w4007_34494-n', synset='odenet-34494-n', note='PHON:ˈteːnoːɐ̯'], [])]).
element('LexicalEntry', [id=w67853], [element('Lemma', [writtenForm=π, partOfSpeech=n], []), element('Sense', [id='w67853_18808-n', synset='odenet-18808-n'], [])]).

@j4n_bur53 yes, you are correct, on the Mac (without any initialization files) the encoding is set to text, not to unicode:

?- current_prolog_flag(encoding, X).
X = text.

The output of locale is:

% locale
LANG=""
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

This is a relatively new Mac (automatically) updated to the latest Big Sur. I have not configured it in any particular way that I can remember.

Not for me. I do the same, I see:

?- [user].
|:('PHON:ˈteːnoːɐ̯'.)
|: ^D% user://1 compiled 0.01 sec, 1 clauses
true.

?- tell('phon.pl'), listing(p/1), told.
true.

?- ^D
% halt
% cat phon.pl 
p('PHON:ˈteːnoːɐ̯').

So this works for me. However, there is something going on with pasting into the terminal. Do you see it?

When I paste this: p('PHON:ˈteːnoːɐ̯').

into the user module, I see ('PHON:ˈteːnoːɐ̯'.):

?- [user].
|:('PHON:ˈteːnoːɐ̯'.)
|: ^D% user://1 compiled 0.01 sec, 1 clauses
true.

?- listing(p/1).
p('PHON:ˈteːnoːɐ̯').

true.

It still works as intended apparently.

Do you also see one space missing from the prompt??

Topic		Replies	Views
Some predicates do not work properly with supplementary Unicode characters in Windows Predicate bug	1	303	January 17, 2023
Encoding set to "text" iso "utf8" whereas LC_CTYPE=UTF-8 Help!	3	562	March 7, 2020
German Umlauts in/for fact names ...? General	3	38	November 29, 2024
I/O problem Help!	3	550	July 15, 2019
Raw to byte or character conversion Help!	23	1335	July 18, 2022

Writing facts including unicode to file

Related topics