Using DCGs for building a serialiser

volker-wysk · May 6, 2021, 1:22pm

Hi!

I want to build (for now) a pretty printer using DCGs. Later I want to build a much bigger serialiser with DCGs. A (recursive) data structure should be the input, and a list of characters (or atoms) the output. I also need to transport a state, which contains the current column and indentation.

They say that DCGs can be used for this purpose (apart from building parsers). I’d like
to know how this is done. I’ve experimented a bit, but the result looks more like botch-up
than a clean solution to me.

For instance, I get the result in backwards order. If I used “append” I’d get square
complexity. Are the implicit last two arguments a difference list? Should they be input and output or output and input? And so on.

I couldn’t find a textbook or a paper which treats this topic.

So far,
Volker

EricGT · May 6, 2021, 1:27pm

If the data is simple enough then post the BNF and I will give a first rough copy of converting it to a DCG. Please don’t expect it to be pure in the sense that the code can both parse and generate.

Also I suspect that because you are passing the current column and indentation that more accumulators will be needed or the accumulator will need to be a compound term instead of just a single term.

volker-wysk · May 6, 2021, 2:35pm

I’ve shortened the grammar. I’m not sure how BNF looks when there are arguments to the rules. When I was in university and learned about BNF (a long time ago), there weren’t any. But I think this should be intelligible:

paragraph
   ---> heading(string)
   |    text(list(inline))

inline
   ---> word(string)
   |    macro(string)
   |    space

word(Str) 
   ---> [Str]

macro(Str)
   ---> ['<<', Str, '>>']

heading(Str)
   ---> nl, ['== ', Str, ' =='], nl

text([]) 
   ---> []

text([I|Is])
   ---> inline(I), text(Is)

space 
   ---> [' ']

nl
   ---> ['\n']

Yes, I know.

The deal with the additional state (column and indentation) is, when the word, macro or space rules reach the max text width (with the column), then a newline should be generated, and it should be indented as specified with the indentation. Therefore there would be two more rules, called “indent” and “unindent”, which manipulate the indentation. And all rules (with terminals) need to update the column.

Cheers,
Volker

EricGT · May 6, 2021, 2:38pm

Thanks,

After a quick glance it looks this can be done. Now to try it!

EricGT · May 6, 2021, 2:54pm

In your BNF what does [ ... ] represent?
Is it one or more, zero or more, or something else?

If I look at EBNF then [ ... ] means optional. Now I am really confused.

EDIT

Can you give some sample input and the expected output.

This is what I am thinking can be valid input.

== Heading Line 1 ==
== Heading Line 2 ==

<<macro 1>>
<<
 macro 2
>>

regular line 1
regular line 1

and the parsed result could be a compound term representing an AST.

paragraph(
  [heading("Heading Line 1"),heading("Heading Line 2")],  
  [macro("macro 1"),macro("macro 2")],
  [text("regular line 1"),text("regular line 2")]
).

volker-wysk · May 6, 2021, 3:11pm

I’ve taken it from the DCG syntax. It means “literal”.
In the case of the “text(Str)” rule, it means “list” (empty list “[]” / list with “[Head|Tail]”).

…

You’re faster than I can answer.

Here’s some input:

paragraph(
  [heading("Heading Line 1"), 
   heading("Heading Line 2"),  
   macro("macro 1"), 
   macro("macro 2"),
   text([word("regular"), space, word("line"), space, word("1")])
).

volker-wysk · May 6, 2021, 3:13pm

The expected output:

== Heading Line 1 ==
== Heading Line 2 ==
<<macro 1>>
<<macro 2>>
regular line 1

EricGT · May 6, 2021, 3:14pm

Thanks.

I was looking at it backwards expecting text to a compound term, but this is a compound term to text.

volker-wysk · May 6, 2021, 3:20pm

Oops. That’s not correct. Macros are “inlines”. The input should be:

paragraph(
  [heading("Heading Line 1"), 
   heading("Heading Line 2"),  
   text([macro("macro 1"), macro("macro 2"), word("regular"), space, 
           word("line"), space, word("1")])
).

And the output:

== Heading Line 1 ==
== Heading Line 2 ==
<<macro 1>><<macro 2>>regular line 1

EricGT · May 6, 2021, 3:26pm

volker-wysk:

paragraph(
  [heading("Heading Line 1"), 
   heading("Heading Line 2"),  
   text([macro("macro 1"), macro("macro 2"), word("regular"), space, 
           word("line"), space, word("1")])
).

Is the [ in [heading("Heading Line 1"), a typo? Or is there suppose to be a matching one that is missing?

volker-wysk · May 6, 2021, 3:32pm

Oops again. I’ve shortened the grammar too much. Sorry.

Here it is again:

page ---> 
   list(paragraph)

paragraph
   ---> heading(string)
   |      text(list(inline))

inline
   ---> word(string)
   |    macro(string)
   |    space

word(Str) 
   ---> [Str]

macro(Str)
   ---> ['<<', Str, '>>']

heading(Str)
   ---> nl, ['== ', Str, ' =='], nl

text([]) 
   ---> []

text([I|Is])
   ---> inline(I), text(Is)

space 
   ---> [' ']

The input is a page:

  [heading("Heading Line 1"), 
   heading("Heading Line 2"),  
   text([macro("macro 1"), macro("macro 2"), word("regular"), space, 
           word("line"), space, word("1")])
  ]

The output is like I’ve written.

EricGT · May 6, 2021, 3:39pm

Since this topic is mainly between just two of us for now, it has moved to a private message. If you want to be included just ask and we can invite you.

Boris · May 7, 2021, 1:57am

(I hope I am not writing stuff you already know, the topic is now private, and I am not sure if you showed your initial attempt)

With phrase/2, you have the list as the second argument. So if you want to have something serialized, the easier way is to do:

phrase(my_dcg(Term), Serialized)

You should also take advantage of the DCG libraries, library(dcg/basics) and library(dcg/high_order).

This is one way to write it:

:- use_module(library(dcg/basics)).
:- use_module(library(dcg/high_order)).

heading(H) --> "== ", atom(H), " ==".

macro(M) --> "<<", atom(M), ">>".

word(W) --> atom(W).

space --> " ".

text(T) --> sequence(inline, T).

inline(X) --> call(X).

page(P) --> sequence(line, P).

line(X) --> call(X), "\n".

(Note, I wouldn’t normally make one-liners like this but this is to keep it shorter on the page here)

And this is how you would use it, from your posts:

volker-wysk:

The input is a page:

  [heading("Heading Line 1"), 
   heading("Heading Line 2"),  
   text([macro("macro 1"), macro("macro 2"), word("regular"), space, 
           word("line"), space, word("1")])
  ]

Using the grammar I showed above (don’t forget the two libraries!):

?- phrase(
       page([heading("Heading Line 1"), 
             heading("Heading Line 2"),  
             text([macro("macro 1"), macro("macro 2"), word("regular"), space, 
               word("line"), space, word("1")])
            ]),
       Serialized),
   format("~s", [Serialized]).
== Heading Line 1 ==
== Heading Line 2 ==
<<macro 1>><<macro 2>>regular line 1
Serialized = [61, 61, 32, 72, 101, 97, 100, 105, 110|...].

I used format/2 to show what the list of codes in Serialized would look like if you printed it.

Does that help at all? I can explain some of the finer points if you think this is roughly what you are after. Keeping state is another thing, let’s keep if simple for now.

EricGT · May 7, 2021, 8:16am

The only reason we took the messages private was so that the other 99% of the users did not have to be notified every time we created a new reply that was only of value to figuring out the details.

As I noted, if you would like to be included in those messages we can just drop @Boris in one of the replies and you will start seeing the private topic and all of the replies.

You are more than welcome to be included.

Boris · May 7, 2021, 10:04am

My only point was, I don’t know if what I wrote was still relevant.

The question itself is worth having a public discussion. I definitely remember struggling a bit at some point with this one thing in particular:

If you are using a DCG to turn a (nested) term into text, then you need to give the term to your DCG, and the generated text is the list in the second argument to phrase/2. As I wrote above:

phrase(my_dcg(Term), Serialized)

Also, from what was still visible, I was left with the impression that the problem that @volker-wysk was facing was clear enough for a complete solution. I already showed one possible complete solution in my post above.

EricGT · May 7, 2021, 10:15am

Yes, it is relevant.

It is not the same as what I wrote or done in the same manner but that is not a negative it is a positive in that it can help others learn other ways to solve the problem. The way you do DCGs is different than the way I do DCGs and many people understand and prefer your way over the way I do them.

Actually he has a thousand lines of Mercury code and it is more complex than I thought. However as we know with DCGs and parsing that there are many redundant patterns such as the regular expression concepts.

I am going to invite you to the private message so you can see more, and please jump in with your ideas, I am sure @volker-wysk would like to have more choices.

You should have received the invite.

Boris · May 7, 2021, 10:28am

Thank you Eric. I am ready to help with generic advice and small examples, publicly, so that others who struggle with the same problem can benefit. I don’t think I have the bandwidth to help with bigger problems at this moment.

volker-wysk · May 7, 2021, 2:32pm

So input becomes output and output becomes input, in comparision to DCGs making a parser. That’s an important point, which wasn’t clear to me.

I’ve taken your example and simplified it, such that no DCG libraries are used, and the two “calls” are eliminated. This is to make it easier to understand what’s going on:

page([]) --> "".
page([P|Ps]) --> paragraph(P), page(Ps).

paragraph(heading(Str)) --> heading(Str), "\n".
paragraph(text(Inlines)) --> text(Inlines), "\n".

heading(H) --> "== ", H, " ==".

macro(M) --> "<<", M, ">>".

word(W) --> W.

space --> " ".

text([]) --> "".
text([I|Is]) --> inline(I), text(Is).

inline(word(Str)) --> word(Str).
inline(macro(Str)) --> macro(Str).
inline(space) --> space.

I actually don’t do my serialiser in Prolog, but in Mercury. I don’t have the two SWI-Prolog DCG libraries there. The “call” predicate also isn’t available in Mercury (higher order calls are done differently). In the Mercury language reference manual, in the section about DCGs, it’s written that syntax and semantics are identical to Prolog. Therefore I asked in the SWI-Prolog list. But now, there are differences.

I have been able to transform the simplified version of your example to Mercury. For the curious, here is the Mercury code. It’s a direct translation. But I’m producing a list of strings, instead of a list of characters:

:- type page == list(paragraph).

:- type paragraph
   ---> heading(string)
   ;    text(list(inline)).

:- type inline
   ---> word(string)
   ;    macro(string)
   ;    space.


:- pred page(list(paragraph)::in, list(string)::out, list(string)::in) is det.
page([]) --> [].
page([P|Ps]) --> paragraph(P), page(Ps).

:- pred paragraph(paragraph::in, list(string)::out, list(string)::in) is det.
paragraph(heading(Str)) --> heading(Str), ["\n"].
paragraph(text(Inlines)) --> text(Inlines), ["\n"].

:- pred heading(string::in, list(string)::out, list(string)::in) is det.
heading(H) --> [ "== ", H, " ==" ].

:- pred macro(string::in, list(string)::out, list(string)::in) is det.
macro(M) --> ["<<", M, ">>"].

:- pred word(string::in, list(string)::out, list(string)::in) is det.
word(W) --> [ W ].

:- pred space(list(string)::out, list(string)::in) is det.
space --> [" "].

:- pred text(list(inline)::in, list(string)::out, list(string)::in) is det.
text([]) --> [].
text([I|Is]) --> inline(I), text(Is).

:- pred inline(inline::in, list(string)::out, list(string)::in) is det.
inline(word(Str)) --> word(Str).
inline(macro(Str)) --> macro(Str).
inline(space) --> space.

Now I’ll try to add that additional state (indentation and column) to the grammar.

Cheers,
Volker

peter.ludemann · May 9, 2021, 12:36am

If you want another example of a reversible DCG, see protobufs.pl, either protobuf_segment_message/2 or protobuf_message/2 … many of the rules can be used reversibly as-is; but some require reordering the goals based on whether an argument satisfies var/1, nonvar/1, ground/1 (often, these involve arithmetic … it’s possible that the reordering could be avoided by using when/2, but I haven’t looked deeply into that.

volker-wysk · May 9, 2021, 11:48am

That’s a misunderstandig. I don’t need a reversible DCG. I’ts just that the second argument of phrase/2 and phrase/3 is an output argument when doing a serialiser, and is an input argument for a parser. For phrase/3, the third argument is an input argument for a serialiser and an output argument for a parser. It’s the same with the two implicit arguments of the predicates that have been defined with DCG syntax, which aren’t called directly in Prolog (but in Mercury).

Bye

Topic		Replies	Views
Is there a better way to convert a compound term to a string than this? Help!	4	1384	May 18, 2019
What's the idiomatic way of developing DCGs? Help!	8	640	December 17, 2020
Calling call//1 makes DCG nonfunctional? (+ "difference list" in DCGs vs. "difference list" for appending) Help! discussion , how-to	20	2501	July 26, 2020
Example parse of binary data using DCG (Discussion) Discussion	0	256	October 10, 2019
Entry level Question: DCGs over structured alphabets Help! how-to	2	140	February 13, 2024

Using DCGs for building a serialiser

Related topics