Parsing text using a formal grammar

Parsing text is a common requirement for many applications. Prolog has traditionally supported this by mapping text to a list and using DCG’s. Alternatively, SWI-Prolog provides a regular expression library (“pcre”) for parsing text that can be defined using regular grammars (in Chomsky terminology). But regular expressions can’t express context free and context sensitive grammars that are common in many machine oriented languages.

Formal grammars supporting these more powerful languages have been around for decades but haven’t gained the same traction in the programming community as regular expressions. They are mainly used in compiler front ends and standards documents. In the former case the “tools” typically produce source code in some target language which gets compiled into the application. In the latter case, they usually function as specifications, not intended to be directly executed.

pPEG is a dialect of PEGs (Parsing Expression Grammars). PEGs are well suited for recognizing machine oriented languages that are non-ambiguous (i.e., deterministic) with linear performance characteristics (O(n) where n is the length of the input text). pPEG is used much like regular expressions: the grammar is defined in a quasi-quotation with syntax pPEG which gets “compiled” to a grammar term. This grammar then used in parsing input text in the specified language to and produce a generic tree structure, called a ptree. A simple example for csv (comma separated values, RFC 4180):

csv_grammar({|pPEG||
	CSV     = Hdr Row+
	Hdr     = Row
	Row     = field (',' field)* '\r'? '\n'
	field   = _string / _text / ''

	_text   = ~[,\n\r]+
	_string = '"' (~["] / '""')* '"'
|}).

sample_csv({|string||
Details,Month,Amount
Mid Bonus,June,"2,000"
,January,"""zippo"""
Total Bonuses,"","5,000"
|}).

/*
?- csv_grammar(G), sample_csv(Data), peg_parse(G, Data, CSVTree).

CSVTree = 'CSV'([
	'Hdr'(['Row'([field("Details"),field("Month"),field("Amount")])]),
	'Row'([field("Mid Bonus"),field("June"),field("\"2,000\"")]),
	'Row'([field(""),field("January"),field("\"\"\"zippo\"\"\"")]),
	'Row'([field("Total Bonuses"),field("\"\""),field("\"5,000\"")])
]).
*/

Alternatively, the grammar could be specified in a quasi-quoted string, or a Prolog double-quoted string, and compiled explicitly.

pPEG is designed for portability so, like regular expressions, there is no support for semantic actions or predicates in the grammars that could jeopardize that objective. The generic ptree parse result is a structure that only requires arrays (or lists) and strings which are supported by any general purpose programming language (although the details may vary between implementations). Any subsequent semantic analysis and processing uses the (language specific) ptree format as input; walking the ptree is a pretty straight forward process in symbolic languages like Prolog. (An online “dingus” based on the Javascript implementation can be found at < https://pcanz.github.io/pPEGjs/dingus.html >.)

pPEG for SWI-Prolog is available as an add-on pack (SWI-Prolog packages or directly at GitHub - ridgeworks/pPEGpl: pPEG grammar support for SWI-Prolog). It’s implemented entirely in Prolog so the raw performance is unlikely to match what a custom parser can achieve, since the software consists of layers of virtual machines (pPEG on SWI-Prolog), but it’s not terrible (a few microseconds per character depending on grammar and input text). In many cases where text parsing isn’t a performance critical component in the end-to-end processing, it may well be good enough. And even when a custom parser is warranted, starting with a formal, and executable, grammar specification is usually a good idea. Additional documentation and examples can be found in the github repo.

Even if you don’t require text parsing support, you may be interested in a formal grammar for SWI-Prolog syntax which can be found in the pPEGpl repo at https://github.com/ridgeworks/pPEGpl/tree/main/Examples/SWIP-grammar. I’m not sure it covers all the possible corner cases of the builtin SWIP parser, but it does recognize/parse all the top level files in the SWIP release boot and library directories given the necessary set of global operator definitions. (It’s not a compiler.) I think a formal syntax specification is a useful addition which complements the existing reference material.

5 Likes