Wiki Discussion: DCG and phrase/3

Starting to think that one should avoid closed list when parsing with DCGs and use difference list, e.g. accumulator passing, and when the accumulator needs to be converted to a value then have a predicate that converts the codes to a value, e.g. string, number, etc. but the value should be put in a structure, e.g. port(Number), url(String) by the predicate that returns the codes converted to a value upon returning the value.

EDIT

More rules of thumb that seem to be making more sense when converting BNF to DCGs for parsing.

  1. Create one module (base module) that contains the BNF converted to DCGs using difference list for accumulator passing.

  2. Using a new module as a library to create the other needed code that builds on top of the base module. There can be more than one need for using the base module with conflicting needs, e.g. reproducing the input exactly as output, or converting the input it a parse tree, or converting the input into an abstract syntax tree. In other words don’t put all of the code into one file as this could result in predicates with the same predicate indicator but with different uses which obviously calls many unnecessary calls.

  3. Don’t do any premature optimizations, e.g. something as simple as this could be a possible premature optimization that conflicts with a different need latter.

without optimization

proseval(T0,T) -->
    "<",
    { T0 = [0'<|T1] },
    proseval_text(T1,T2),
    ">",
    { T2 = [0'<|T] }.

with premature optimization

proseval([0'<|T0],T) -->
    "<",
    proseval_text(T0,[0'>|T]),
    ">".

In other words if you write DCGs for parsing based on BNF and those same BNF rules are used for multiple purposes then if you do premature optimizations you can find yourself undoing those optimizations for a different use so the code can be reused. So just avoid the optimizations in the base module. Also as I like to note often, get the code working correctly first then go back and do optimizations if needed. If needed means that your code is so slow you can take a coffee break, not that you optimize because you can.


EDIT

Another concept that that I have known about but need to relate in the Wiki would be a Chinese Wall when creating the structure that will be used, e.g parse tree or AST. Depending on how much information you need defines how the predicates are created that convert the codes to structures. In other words you will need more then one set of instructions crossing over the Chinese Wall. Not knowing about having more than one Chinese Wall document is a major hurdle that many don’t see past when trying to learn about parsing. It could also be one document with sections that have conflicting results. They don’t see that sometimes one document will work and at other times many are needed. When many are needed I am finding that modules come in handy.


EDIT

Need to consider a section on Mixfix operators (ref)


EDIT

Starting to think there are at least three different types of fixed point checks.

  1. Characters codes - Useful as a recognizer.
  2. Format preserving - Useful when all the white space and such needs to be preserved. Can’t recall this one ever being used with programming languages. Even languages such as Python where indentation means something this level of parsing is to much. This has a use but I very rarely need it. Another example where this one would be needed is if a number has leading zeros and they need to be preserved, e.g. 00001.
  3. Canonical - Unless preserving formatting is needed, this should be the one to achieve.

EDIT

Useful info in this topic: Phrase_from_file vs phrase