How to create a DCG-predicate that is usable as both a parser and a generator?

Hi everyone!

I recently wanted to become better at Prolog (having used it in the past only in an “Introduction to Logic”-course at university). As I am also learning a bit to speak Esperanto, which is a constructed language and therefore reasonably regular/systematic I thought it would be an interesting challenge to attempt to create a parser/generator of simple Esperanto sentences with Prolog.

For instance, in Esperanto all (singular) nouns end with -o. It is thus easy to write a predicate like

stem_noun(Stem, Noun, singular) :- 
  string_concat(Stem, "o", Noun).

noun(noun(Stem, Number) -->
  [Noun], {
    stem_noun(Stem, Noun, Number)

This works fine for parsing, when used like phrase(noun(Res), ["knabo"]).. It also works fine for testing when all arguments are filled in.

It however does not work for generating, when used like phrase(noun(_), Res). because string_concat will complain that Stem is not instantiated enough.
Arguably this is reasonable; after all when generating words it might make sense to only stick to a vocabulary of known words.

Thus I might rewrite stem_noun as:

stem_translation("knab", "boy").
stem_translation("flor", "flower").
% etc
% ...

stem_noun(Stem, Noun, singular) :-
  stem_translation(Stem, _),
  string_concat(Stem, "o", Noun).

However, this will restrict the user to only using words for which the stem is known. As (a) it is infeasible for the vocabulary to be complete as new words get made from time to time and (b) names of for instance persons are also used as nouns, this would be too restrictive.

So how is this managed? One approach I considered is

stem_noun(Stem, Noun, singular) :-
  (nonvar(Noun) ; stem_translation(Stem, _)),
  string_concat(Stem, "o", Noun).

which will make the flow different depending on the direction the predicate is used. I believe it “works as intended”, but I’m not sure how idiomatic this is. Are there better ways to approach this problem?


Some simple advise.

  1. When first learning to use DCGs for both parsing and generating, I have found that it makes more sense to develop them independently until you have them both working. Then if there is enough commonality in the code they can be combined. Sometimes it works seamlessly, sometimes it appears to work seamlessly when using the code but the code is not easy to understand (e.g. lots of conditionals), sometimes it turns into a nightmare and can’t be done seamlessly. Using constraints also helps at times.

  2. Since you are using a spoken language and not a programming language for your exercise you will need to look at work using Natural Language Processing. Luckily there is a nice free intro book.

“Prolog and Natural-Language Analysis” by Fernando C. N. Pereira and Stuart M. Shieber (site) (free pdf )



Cool to see someone picking up Prolog after uni course instead of hating it.
I’ve been prototyping parsing/generating of grammatical forms using Latin
for a couple of years now, so there’s a couple of things I’ve learned:

  1. If you want the same predicate to parse and generate, you should generally rely on unification as much as you can, as opposed to predicates like string_concat, which means DCGs are preferred.

Getting a stem from a noun like this:

stem_noun(Stem, Noun, singular) :-
    string_concat(Stem, "o", Noun).

can be similarily expressed as

noun_ending(singular) --> "o".
noun_ending(plural) --> "oj".
% noun(-Stem, -FormDescription)
noun(Stem, noun(Number)) -->
    string(Stem), noun_ending(Number).

which is both a parser and a generator. This changes the representation of a noun from a string to character-code list, but offers a couple of advantages: DCGs compose nicely like
string and nound_ending in the example; it’s two-way with less hassle; this is how
files are read by default in Prolog (I think).

  1. As Eric pointed out, constraints help a lot to keep your code two-way.
    In fact, I think they’re a good way to add a dictionary support too.
    Suppose you have a noun predicate like the one above.
    You can inject a dictionary check like this:
stem_translation(`knab`, "boy").

translate_noun(Noun, Translation) :-
        (nonvar(Stem) ; nonvar(Noun)),
        stem_translation(Stem, Translation)),
    phrase(noun(Stem, noun(_)), Noun).

This runs the stem_translation goal once either Stem or Translation is known,
meaning it can both parse and generate, depending on grounded-ness of these variables.
In fact, you can even run translate_noun(Noun, Translation) and
you will get all forms for all nouns in the dictionary.

In principle, you should consider using coroutines wherever you see a need for checks like
var/1, grounded/1 etc. These checks are performed once, so they often get repeated;
coroutines avoid this repetition and allow you to put constraints that get checked
even within code that you do not control, like library calls.

  1. This may be more arbitrary, but you can try and shift complexity from call structure
    onto the data structure. What I mean by that: for now you have a noun predicate.
    Let’s say you want to parse verbs as well and have a predicate word that works for both
    nouns and verbs. You can add a verb predicate and have word predicate invoke
    both specific predicates as alternatives. This, however, has some major drawbacks:
    a) backtracking can be expensive when logic grows and
    b) part of the meaning of theoretically simple predicates is buried in the call stack:
    in case of noun_ending(singular) --> "o" knowledge that the word ending in -o might be
    a noun is logically conveyed only by the fact that this predicate is called somewhere by noun.

Compare this with a different approach:

ending(noun(singular)) --> "o".
ending(noun(plural)) --> "oj".
ending(verb(present)) --> "as".

word(Stem, FormDescription) --> string(Stem), ending(FormDescription).

Here we make the call structure simpler, but the data we pass to predicates is richer.
Everything we can learn about a word, including its part of speech is present locally in the
predicate itself. This means the code can grow considerably with more grammatical categories
than just number or tense, some of which will need to be left unbound, eg. in latin:
ending(noun(plural, genetive, _Gender)) --> "um". the ending -um tells us nothing about
the noun’s gender but we still need to include it in the pattern.
In the end this is a verbosity-complexity-refactorability tradeoff.

  1. Avoid cuts and other impure constructs if you can, they’re evil.

Whew, this post grew, but I think these four points cover most of I found most useful.

Oh and

  1. I focused on parsing single word forms because that’s what I’m most familiar with,
    but for whole sentences keep in mind DCGs may not be the best option
    if you intend to parse sentence structure allow for non-strict word order.

It may not have been a good idea, but for my HTTP/2 client library, I wrote the DCGs for frames to both generate and parse; see here. As @AmbientTea mentioned, it uses coroutining plus the “reifed if” predicate if_ from library(reif) quite a bit.


Thank you very much, @AmbientTea! This is exactly the kind of expert techniques that are missing from all the ‘start using Prolog’ guides and books that I’ve been able to find so far. Very cool! :green_heart:

I’ll let you know what happens once I have the time to try the techniques out on my code.

@jamesnvc thank you very much for that example as well!

1 Like