Counting actual prolog lines loaded

Hello,

I noticed (great) code that counts lines of a file – so i could use this to load a files in a folder and count their lines.

However, i am curious as to how many lines of code a loaded program has. Can this be counted?

There is listing but i am unsure – from the documentation – if it goes into modules – and it does also list things that aren’t part of the consulted files, but predefined.

Edit:

Can one identify in a line a valid prolog term, without enumerating all possible terms it could be?

Using the file read approach, i am now wondering if i can identify in a given line a valid prolog term – if such a term can be identified, then its a valid prolog line and not something else --i.e. comment or blank line

thanks,

Dan

Edit:

That’s the code i currently have:

%------------------------- 
% all, and only,  *.pl files are in a folder in path
% path atom must end with '\\'
%-------------------------

count_lines(Path, Count) :-
	dir_files(Entries),
	aux_count_file(Path, Entries, Count).

aux_count_file([], 0).

aux_count_file(Path, [First | Rest], New_Count) :-
	First \= '.',
	First \= '..',
	file_nl_count(First, Count),
	aux_count_file(Path, Rest, Count_Rest),
	New_Count is Count + Count_Rest.
	
aux_count_file(Path, [_First | Rest], New_Count) :-	
	aux_count_file(Path, Rest, New_Count).


dir_files(Path, Entries) :-
	directory_files(Path, Entries).

file_nl_count(Path, File0, Count) :-
	atomic_concat(Path, File0, File),
	format('~w~n', [File]),
    setup_call_cleanup(
        open(File, read, In),
        stream_nl_count(In, Count),
        close(In)).

stream_nl_count(In, Count) :-
    stream_to_lazy_list(In, List),
    aggregate_all(count, member(0'\n, List), Count).	

Is there an aggregate that could be used to make a word histogram?
Word histogram would not count lines, but count the occurences of words
in a text file, and list it for every word of a text file.

My suspicion, there is! If you use aggregate/3 and not aggregate_all/3 as
you do in your example. But I have also read people suggesting dicts for
that purpose. But dicts do only have O(N) operations right now.

How would you program it, with our without dicts?

If you want to find all the clauses in a file, you could probably use read_term/2 with the subterm_positions option (or term_expansion/4, which uses read_term/2 with that option).

(I hope I didn’t misunderstand the conversation)

The aggregate predicate family is useful but at the moment not generic enough. The obvious to me solution to single-pass aggregation (very large files, streams) is as follows:

  1. Create a predicate that backtracks over the solutions;
  2. Use a non-backtrackable data structure for aggregation.

For the second one, I have used (nb)_rbtrees with this tiny bit of code:

nbdict_apply(X, Key, Pred, Init) :-
    (   nb_rb_get_node(X, Key, Node)
    ->  nb_rb_node_value(Node, Val0),
        call(Pred, Val0, Val1),
        nb_rb_set_node_value(Node, Val1)
    ;   nb_rb_insert(X, Key, Init)
    ).

This will either insert a default Initial value or apply the Predicate to the existing value associated with the Key. This post by Jan W discusses the computational complexity. It also suggests an obvious and better way to count word frequency that does not work if your input is big enough. I guess your question refers in part to that post?

If you do it like this, you can simply use a forall/2. If we assume you have defined file_word/2 predicate that succeeds once for every word in a file, you would do:

rb_empty(Freqs),
forall(file_word(File, Word),
    nbdict_apply(Freqs, Word, succ, 1)),
forall(rb_in(Key, Val, Freqs),
    format("~w ~w~n", [Key, Val]))

But this is still not optimal, you need to hand-roll both the backtracking predicate and the non-backtrackable accumulator for anything non-trivial. Do you have a better idea?

I was rather expecting some usage of aggregate/3:

test(File) :-
     aggregate(count, file_word(File, Word), Count),
     format("~w ~w~n", [Word, Count]),
     fail; true.

Works on my side. Can simulate the file with a few facts:

file_word(foo, bar).
file_word(foo, baz).
file_word(foo, bar).

This gives me:

?- test(foo).
bar 2
baz 1
true.

How comes this works? The aggregate/3 performs bagof/3 on the goal.
But maybe an aggregate/3 implementation with nb_rbtrees is better.
You have to try, most aggregates wouldn’t be problematic.

I think @boris doesn’t like that this program uses memory that corresponds to the length of input text rather than the set of words. If you are not concerned with that, the most efficient way is to convert the input into a list of words, sort the list without removing duplicates and count adjacent equivalent words. If you want to process in time proportional to the number of words it gets more complicated. A dynamic fact with the word and a counter is rather costly. Hash tables or trees, possibly using destructive assignment come to mind. My most recent approach for a (different) counting problem was to maintain facts counter(+Obj, -Index), where each new object found gets the next index and then a global array (compound) where we use nb_setarg/3 to update the count. If the array gets too small we double it in size. You find the code in the new test coverage code (package plunit).

How do you do global arrays, is this something new?

Is it not possibly to have this:

Globally, like a global variable?

You could then use the nb_rbtree to updated values.

Edit 19.07.2021:
Seems not to work since clauses do not point into terms.
Everything gets cast into instructions. So what are global arrays?

?- [user].
test(foo(bar)).

?- vm_list(test/1).
========================================================================
test/1
========================================================================
       0 s_virgin
       1 i_exit
----------------------------------------
clause 1 (<clause>(0000000006B8CA40)):
----------------------------------------
       0 h_functor(foo/1)
       2 h_atom(bar)
       4 h_pop
       5 i_exitfact
true.

I don’t not like it :slight_smile: I just thought it is a well understood solution. You even provided the clumped/2 predicate recently to make this even easier, together with the “counting word frequency” example. I thought the question was “what if you can’t do it the normal way”…

I know that memory is cheap etc but sometimes the data is too much, and we are moving more and more to an “Infrastructure as a service” mode of operation and the costs are slowly starting to add up.

clumped/2 is not the same as histogram. clumped/2 gives:

?- clumped([a,a,b,a,a,a,a,c,c,c], R).
R = [a-2, b-1, a-4, c-3].

histogram needs to give:

?- histogram([a,a,b,a,a,a,a,c,c,c], R).
R = [a-6, b-1, c-3].

You can implement histogram as follows:

histogram(L, R) :-
    findall(X-C, aggregate(count, member(X, L), C), R).

See also (seems Wikipedia wants me to call it bar chart,
since the x-axis will be words and not some numbers):

https://en.wikipedia.org/wiki/Histogram

Edit 20.07.2021:
But is there a relationship with dicts? Just for fun I did the following, a dictall/4 which is the findall/3 equivalent, but instead of creating a list, it would create a dict:

dictall(T, P, G, D) :-
     findall(P, G, L),
     dict_pairs(D, T, L).

We could implement histogram as follows:

histogram(L, R) :-
    dictall(histogram, X-C, aggregate(count, member(X, L), C), R).

The result is then:

?- histogram([a,a,b,a,a,a,a,c,c,c], D).
D = histogram{a:6, b:1, c:3}.

If you looked at the example under clumped/2 you will see the thing I was referring to:

msort(Words, Sorted), clumped(Sorted, Counts)

But this is of course unnecessary if you already have your data in the Prolog database. Your aggregate is perfectly good, and before that was available, I have memories of doing something like this:

bagof(true, file_word(foo, X), Xs), length(Xs, Freq)

but of course using aggregate(count, ...) is better in every way. (EDIT: I can’t now remember which came first, library(aggregate) or library(solution_sequences), but I seem to remember that both were not available when I started using SWI-Prolog; could be a false memory though.)

Just to make it clear, I am looking at this from a very specific angle, namely, how to write Prolog programs that have somewhat predictable memory use regardless of the size of the input data. I now know of at least one more user, @swi , who has run into related issues.

Thats a non-functional requirement. Means its an issue of “how” you implement something and not “what” you implement. Try this if you are interested in memory:

msort/2 solution with clumped/2, the current aggregate/3 based on bagof/3 which is based on keysort/2, all use way too much space. Basically they use a n x cons cells (the list cells [_|_] in msort or keysort) for a count of n. Whereby the nb_rbtrees solution would only use an integer for a count of n.

Well you already tried. You showed use of the nb_rbtrees API, but can it be moved into aggregate/3? You wouldn’t need to do it if all your counts are = 1. But if counts are in the average 100 or so, then you can safe 100 times less cons cells (the list cells [_|_] in msort or keysort) .

You would trade the many list cells to the fewer tree “cells” aka nodes that make up the red-black-tree. But it need not be a red-black-tree, a hash table could also work.