Counting actual prolog lines loaded

Hello,

I noticed (great) code that counts lines of a file – so i could use this to load a files in a folder and count their lines.

However, i am curious as to how many lines of code a loaded program has. Can this be counted?

There is listing but i am unsure – from the documentation – if it goes into modules – and it does also list things that aren’t part of the consulted files, but predefined.

Edit:

Can one identify in a line a valid prolog term, without enumerating all possible terms it could be?

Using the file read approach, i am now wondering if i can identify in a given line a valid prolog term – if such a term can be identified, then its a valid prolog line and not something else --i.e. comment or blank line

thanks,

Dan

Edit:

That’s the code i currently have:

%------------------------- 
% all, and only,  *.pl files are in a folder in path
% path atom must end with '\\'
%-------------------------

count_lines(Path, Count) :-
	dir_files(Entries),
	aux_count_file(Path, Entries, Count).

aux_count_file([], 0).

aux_count_file(Path, [First | Rest], New_Count) :-
	First \= '.',
	First \= '..',
	file_nl_count(First, Count),
	aux_count_file(Path, Rest, Count_Rest),
	New_Count is Count + Count_Rest.
	
aux_count_file(Path, [_First | Rest], New_Count) :-	
	aux_count_file(Path, Rest, New_Count).


dir_files(Path, Entries) :-
	directory_files(Path, Entries).

file_nl_count(Path, File0, Count) :-
	atomic_concat(Path, File0, File),
	format('~w~n', [File]),
    setup_call_cleanup(
        open(File, read, In),
        stream_nl_count(In, Count),
        close(In)).

stream_nl_count(In, Count) :-
    stream_to_lazy_list(In, List),
    aggregate_all(count, member(0'\n, List), Count).	

If you want to find all the clauses in a file, you could probably use read_term/2 with the subterm_positions option (or term_expansion/4, which uses read_term/2 with that option).

(I hope I didn’t misunderstand the conversation)

The aggregate predicate family is useful but at the moment not generic enough. The obvious to me solution to single-pass aggregation (very large files, streams) is as follows:

  1. Create a predicate that backtracks over the solutions;
  2. Use a non-backtrackable data structure for aggregation.

For the second one, I have used (nb)_rbtrees with this tiny bit of code:

nbdict_apply(X, Key, Pred, Init) :-
    (   nb_rb_get_node(X, Key, Node)
    ->  nb_rb_node_value(Node, Val0),
        call(Pred, Val0, Val1),
        nb_rb_set_node_value(Node, Val1)
    ;   nb_rb_insert(X, Key, Init)
    ).

This will either insert a default Initial value or apply the Predicate to the existing value associated with the Key. This post by Jan W discusses the computational complexity. It also suggests an obvious and better way to count word frequency that does not work if your input is big enough. I guess your question refers in part to that post?

If you do it like this, you can simply use a forall/2. If we assume you have defined file_word/2 predicate that succeeds once for every word in a file, you would do:

rb_empty(Freqs),
forall(file_word(File, Word),
    nbdict_apply(Freqs, Word, succ, 1)),
forall(rb_in(Key, Val, Freqs),
    format("~w ~w~n", [Key, Val]))

But this is still not optimal, you need to hand-roll both the backtracking predicate and the non-backtrackable accumulator for anything non-trivial. Do you have a better idea?

I think @boris doesn’t like that this program uses memory that corresponds to the length of input text rather than the set of words. If you are not concerned with that, the most efficient way is to convert the input into a list of words, sort the list without removing duplicates and count adjacent equivalent words. If you want to process in time proportional to the number of words it gets more complicated. A dynamic fact with the word and a counter is rather costly. Hash tables or trees, possibly using destructive assignment come to mind. My most recent approach for a (different) counting problem was to maintain facts counter(+Obj, -Index), where each new object found gets the next index and then a global array (compound) where we use nb_setarg/3 to update the count. If the array gets too small we double it in size. You find the code in the new test coverage code (package plunit).

I don’t not like it :slight_smile: I just thought it is a well understood solution. You even provided the clumped/2 predicate recently to make this even easier, together with the “counting word frequency” example. I thought the question was “what if you can’t do it the normal way”…

I know that memory is cheap etc but sometimes the data is too much, and we are moving more and more to an “Infrastructure as a service” mode of operation and the costs are slowly starting to add up.

If you looked at the example under clumped/2 you will see the thing I was referring to:

msort(Words, Sorted), clumped(Sorted, Counts)

But this is of course unnecessary if you already have your data in the Prolog database. Your aggregate is perfectly good, and before that was available, I have memories of doing something like this:

bagof(true, file_word(foo, X), Xs), length(Xs, Freq)

but of course using aggregate(count, ...) is better in every way. (EDIT: I can’t now remember which came first, library(aggregate) or library(solution_sequences), but I seem to remember that both were not available when I started using SWI-Prolog; could be a false memory though.)

Just to make it clear, I am looking at this from a very specific angle, namely, how to write Prolog programs that have somewhat predictable memory use regardless of the size of the input data. I now know of at least one more user, @swi , who has run into related issues.