Cool to see someone picking up Prolog after uni course instead of hating it.
I’ve been prototyping parsing/generating of grammatical forms using Latin
for a couple of years now, so there’s a couple of things I’ve learned:
- If you want the same predicate to parse and generate, you should generally rely on unification as much as you can, as opposed to predicates like
string_concat
, which means DCGs are preferred.
Getting a stem from a noun like this:
stem_noun(Stem, Noun, singular) :-
string_concat(Stem, "o", Noun).
can be similarily expressed as
noun_ending(singular) --> "o".
noun_ending(plural) --> "oj".
% noun(-Stem, -FormDescription)
noun(Stem, noun(Number)) -->
string(Stem), noun_ending(Number).
which is both a parser and a generator. This changes the representation of a noun from a string to character-code list, but offers a couple of advantages: DCGs compose nicely like
string
and nound_ending
in the example; it’s two-way with less hassle; this is how
files are read by default in Prolog (I think).
- As Eric pointed out, constraints help a lot to keep your code two-way.
In fact, I think they’re a good way to add a dictionary support too.
Suppose you have a noun
predicate like the one above.
You can inject a dictionary check like this:
stem_translation(`knab`, "boy").
translate_noun(Noun, Translation) :-
when(
(nonvar(Stem) ; nonvar(Noun)),
stem_translation(Stem, Translation)),
phrase(noun(Stem, noun(_)), Noun).
This runs the stem_translation
goal once either Stem or Translation is known,
meaning it can both parse and generate, depending on grounded-ness of these variables.
In fact, you can even run translate_noun(Noun, Translation)
and
you will get all forms for all nouns in the dictionary.
In principle, you should consider using coroutines wherever you see a need for checks like
var/1
, grounded/1
etc. These checks are performed once, so they often get repeated;
coroutines avoid this repetition and allow you to put constraints that get checked
even within code that you do not control, like library calls.
- This may be more arbitrary, but you can try and shift complexity from call structure
onto the data structure. What I mean by that: for now you have a noun
predicate.
Let’s say you want to parse verbs as well and have a predicate word
that works for both
nouns and verbs. You can add a verb
predicate and have word
predicate invoke
both specific predicates as alternatives. This, however, has some major drawbacks:
a) backtracking can be expensive when logic grows and
b) part of the meaning of theoretically simple predicates is buried in the call stack:
in case of noun_ending(singular) --> "o"
knowledge that the word ending in -o might be
a noun is logically conveyed only by the fact that this predicate is called somewhere by noun
.
Compare this with a different approach:
ending(noun(singular)) --> "o".
ending(noun(plural)) --> "oj".
ending(verb(present)) --> "as".
word(Stem, FormDescription) --> string(Stem), ending(FormDescription).
Here we make the call structure simpler, but the data we pass to predicates is richer.
Everything we can learn about a word, including its part of speech is present locally in the
predicate itself. This means the code can grow considerably with more grammatical categories
than just number or tense, some of which will need to be left unbound, eg. in latin:
ending(noun(plural, genetive, _Gender)) --> "um".
the ending -um tells us nothing about
the noun’s gender but we still need to include it in the pattern.
In the end this is a verbosity-complexity-refactorability tradeoff.
- Avoid cuts and other impure constructs if you can, they’re evil.
Whew, this post grew, but I think these four points cover most of I found most useful.
Oh and
- I focused on parsing single word forms because that’s what I’m most familiar with,
but for whole sentences keep in mind DCGs may not be the best option
if you intend to parse sentence structure allow for non-strict word order.