Most efficient DCG for text parsing?

A good option is to use term_expansion/2 to generate the 127 rules for you. Typing 127 rules is a bit boring. From the implementation point of view they are most likely the best solution though.

Opinions vary. Roughly you have three options

  • Use var/1 (nonvar/1) checks in the rules.
  • Use delays (when/2, freeze/2)
  • Write two DCGs

All three have their merits. The advantage of the first two is that it keeps the code for parsing and generation together, which makes it easier to maintain the consistency if you change the rules over time. That is particularly true for the delay version. Delays are relatively slow though and you have to be careful if you also want to use cuts (or if->then;else) to make sure all relevant delayed calls have materialized. Committing is more or less obligatory in real-world DCGs, in particular for artificial languages as not doing so typically leads to practically infinite backtracking in case of a syntax error. You also need to be sure not to leave any residual goals behind. And explicit var/nonvar split is typically more efficient, but uglier. Two distinct DCGs avoid all the var/nonvar tests and delays, but make it harder to keep the two in sync. On the other hand, the two DCGs are typically different when it comes to comment, layout handling and other input ambiguities (escaped strings, floating point numbers, etc). The parsing one often simply skips all that while the generating one may wish to keep track of nesting to emit line breaks and indentation. Trying to combine all of that in one implementation is not always a good idea.

… it depends on the use case, requirements and preferences …

3 Likes