DCG to read Prolog terms

mike.elston · August 13, 2019, 8:14am

Hi

I want to write a DCG that parses a mixture of text and Prolog terms e.g.

2019-08-13 123456 my_term(foo)

Library dog\basics can handle the first two tokens:

string_without(, Date), string_without(, ID), ,

but how can I efficiently consume characters to read the term my_term(foo) ?

I could consume increasingly long sequences of characters and test them with read_term_from_atom/3 but that doesn’t seem very efficient. I need something like read_term_from_codes(?Term, Codes, Tail).

Thanks

Mike

Boris · August 14, 2019, 6:29am

Hi and welcome!

Since no one answered yet, I can give it a try.

First, you should try and format your code as code. See here.

To your question. I wouldn’t worry about efficiency at first.

More important: is there a simple delimiter you can test for? End of line, tab, anything? Then just do like you did already:

string_without(Delimiters, S),
{ read_term_from_atom(S, T, []) }

(Don’t forget you still need to consume the delimiter!)

If this isn’t good enough for your input, I’d first try making sure that the input is such that it makes it easy to test for a delimiter.

If this isn’t good enough either: what does your input really look like?

mike.elston · August 14, 2019, 6:45am

Hi Boris

Thanks for the suggestions. The consume-and-test method is my backstop if a deterministic approach is not possible. Or I could define my own limited grammar to read the terms, but that seems a shame since Prolog knows how to read Prolog terms very well indeed. The reason for the efficiency requirement is the amount of data to be processed - gigabytes of log files.

Mike

Boris · August 14, 2019, 7:50am

I think that if you test for a delimiter, it is deterministic. string_without//2 will take all codes up to the delimiter deterministically, then you optimistically give those codes to read_term_from_atom/3 (or why not term_string/2?) Here is the full example:

$ cat parse_log.pl 
:- module(parse_log, [delimited_term//2, nondet_term//1]).

:- use_module(library(dcg/basics)).

delimited_term(D, T) -->
    string_without([D], S),
    { read_term_from_atom(S, T, []) },
    [D].

nondet_term(T) -->
    string(S),
    /* this doesn't even work! */
    { read_term_from_atom(S, T, []) }.

You might want to hardcode your delimiter but anyway.

$ swipl
Welcome to SWI-Prolog (threaded, 64 bits, version 8.1.11-8-gbf5ff3dae)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

For online help and background, visit http://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

?- use_module(parse_log).
true.

?- phrase(delimited_term(0';, T), `foo(bar);`).
T = foo(bar).

I used a semicolon for the delimiter just to make it more obvious on the top-level.

The other one I tried to code by naively following your description, and it even seems to work at first sight:

?- phrase(nondet_term(T), `foo`).
T = foo ;
false.

… but of course you need to catch exceptions. Try parsing a compound with this.

I guess another option you have is to work directly on the stream, without a DCG, and use read_term/3 when you expect a term.

Yet another option would be to use library(csv).

I personally would anyway try to use any common command line tool to clean up (“massage”) my input, to make it more Prolog friendly, and write simple obvious Prolog code to parse it or just read it.

EDIT: for example, why not use awk or something to make your input look like this:

log_line('2019-08-13', 123456, my_term(foo)).
log_line('2019-08-14', 234566, my_other_term(bar)).

Again, this depends on how well defined your actual data is. You might end up getting caught up in quoting problems, for example.

peter.ludemann · August 14, 2019, 8:24am

O’Keefe’s book as a tokeniser in Prolog, IIRC (I’m on vacation and can’t check).

Boris · August 14, 2019, 8:25am

But really, the more I think about it, log files are usually pretty obvously structured. Just use phrase_from_file and use string_without//2 with the correct delimiter.

Boris · August 14, 2019, 8:26am

Yes, it does have the full source. It sounds like a terrible overkill though

jan · August 14, 2019, 12:29pm

Only if the term has a terminating . you can theoretically find another way. You need the . in general because my_term(foo) x is a term if x is a postfix operator. If you know something to match behind the term I’d use nondet widening using the matcher avoid re-trying every character. Ideally, use the full grammar for the remainder of the record, so you get

start(StartTerms, ...), string(TermCodes), end(EndRecordTerms, ...),
catch(term_string(Term, TermCodes), ..., ...)

I typically write log files using Prolog quoted write such that the log file is a valid Prolog program that you can load to analyse it.

fnogatz · August 15, 2019, 3:48pm

Maybe library(plammar) might be useful for you: https://github.com/fnogatz/plammar (Disclaimer: I am the author of it.)

Note that it is not yet published in SWI’s package list, but you can manually install it (and the required dependency library(dcg4pt)) via git as described in the README.

Then, you can parse Prolog terms and get the corresponding abstract syntax tree (AST), parse tree (PT), or token list as follows:

?- prolog_ast(string("my_term(foo)."), AST).
AST = prolog([fact(compound(atom(my_term), [atom(foo)]))]).
?- prolog_parsetree(string("my_term(foo)."), PT).
PT = prolog([clause_term([term([atom(name([name_token(..., ...)])), open_ct(open_token(open_char(...))), arg_list(arg(...)), close(...)]), end([end_token(end_char('.'))])])]).
?- prolog_tokens(string("2019-08-13 123456 my_term(foo)"), Tokens).
Tokens = [integer([integer_token('2019', integer_constant([decimal_digit_char('2'), decimal_digit_char('0'), decimal_digit_char('1'), decimal_digit_char(...)]))]), name([name_token(-, graphic_token([graphic_token_char(graphic_char(-))]))]), integer([integer_token('08', integer_constant([decimal_digit_char('0'), decimal_digit_char(...)]))]), name([name_token(-, graphic_token([graphic_token_char(...)]))]), integer([integer_token('13', integer_constant([...|...]))]), integer([layout_text_sequence([...]), integer_token(..., ...)]), name([layout_text_sequence(...)|...]), open_ct(open_token(...)), name(...)|...].

The created nested term is based on the grammar rules as given in the ISO Prolog standard. Note that prolog_tokens accepts basically all strings, while prolog_parsetree und prolog_ast require the given string to be a valid Prolog program, i.e. in particular ending with a dot.

Topic		Replies	Views
Parsing strings into grammar / terms Help!	3	302	February 22, 2020
Most efficient DCG for text parsing? Algorithm	4	1850	July 5, 2021
Is there a preferred way to check for valid characters when using DCG? Help!	5	838	March 12, 2019
Wiki Discussion: DCG and phrase/3 Wiki Discussion	5	2572	June 12, 2023
An experiment on a mutually recursive system of closures General	0	120	June 2, 2024

DCG to read Prolog terms

Related topics