I hacked plunit/test_cover.pl to give me clause-level coverage information. I was able to determine that I had over 96% coverage and was able to show the exact clauses that weren’t executed (show_coverage/1 merely said 87.8% coverage, with no indication of what wasn’t covered).
But I want to go deeper and ensure that every goal is executed, and it’s not obvious how to do that. Any thoughts?
One possibility: use goal_expansion/4 to wrap every goal call and record those calls, similar to how assert_entered/1 works.
(If anyone wants to improve the test coverage code by incorporating my code, I’ll happily send it to you – right now, it’s very specific to my needs, but it should be fairly easily generalized. I might eventually get around to doing this myself, for the Greater Good of All …)
I’m surely interested in the enhancement you already have. Could you share that?
Just thinking about a complete coverage report. What if you record the location in the parent goal (clause + PC)? That should be enough to figure it out.
Would recording the location of the parent goal work with things like maplist? (I think you’d want to record the grandparent for those, or even grandgrandparent …)
I’ve pushed some elaborate changes to the coverage support. I can now run e.g.,
?- show_coverage(qtest_scasp, [dir(cov)]).
Which creates a directory cov and files *.cov that holds annotated versions of the original files. It annotates every clause in the file with one of
### if the clause was never entered
++N if the clause was entered and succeeded always an N times
--N if the clauses was entered N times and never succeeded
+N-M if the clause succeeded N times and failed M times
The annotations use ANSI color sequences. Having goals inside clauses counted as well would be great The main drawback is that the performance is quite poor (about 30 times slowdown compared to normal execution). Possibly we should move part of the data collection to C …
I guess that depends a little. In most cases the resulting clause(s) are marked as being associated with the line holding the term that was expanded. Of course, this goes wrong if term_expansion/2 uses assertz/1 or similar to modify the program and there are surely some other ways to make this fail
The coverage tool only uses the file and line number associated with the clause.
An option to disable coloring is surely a good idea. Just send a PR Overall I think it is a good start and the main issues are lack of line-level annotation and the performance impact (20~30 times).
There’s a small bug – the File passed to line_annotation/4 is the truncated file name from summary/3. I did a quick work-around by changing line 332 to summary(File, 256, SFile).
Also, please put the annotation information ("###" etc) into the documentation.
(BTW, I didn’t see any “ANSI color sequences” in my .cov files)
The tests will be in SWI-Prolog package plunit https://github.com/SWI-Prolog/packages-plunit
List items such as arguments should include a space after the comma.
but I have choices for the following and need guidance.
Should the test file be a plt or pl file?
Should the test be in a separate directory, I.e. Tests, or in the same directory?
a. If in a separate directory what is the path for directory?
I plan to create temporary files using tmp_file/2 with test setup and cleanup. Is that a current best practice?
I like my test to also serve as example code with more examples added than needed for testing by using the test option forall/1. Is that something I should do or just forgo adding the extra examples?
Some of the tests will test need multiple source files in multiple modules, I am thinking that I will need to create several directories holding the valid results for comparison. Any thoughts or suggestions?
Tests that are part of the system all follow some very simple rules:
That are named test_*.pl
The define a module with name equaling the base name of the file.
They export a single predicate with the same name and zero arity.
The test is supposed to succeed iff the predicate succeeds.
The tests must cleanup such that it can be ran multiple times.
Tests must be aware that other tests may run concurrently. This is mostly an issue if you create files.
For packages there is a CMake function test_libs()
A good example doing most of the nasty stuff is in the zlib package.
That said, how useful is it to test this? Its working depends mostly on the tracer hooks that have their test in the core test suite. Besides playing with the tracer hooks, most of the code is boring and doesn’t touch much that is likely to break. Even if it breaks, no application (should) depend on this.
As far as I’m concerned, tests should mostly address things that are tricky, depend on not very stable interfaces, depend on non-portable platform features, etc. A good deal of the tests originate from fixed bugs. As their used to be a bug there is apparently something non-trivial and the last thing we want is for a reported bug to reappear a couple of versions later. Tests that are very unlikely to ever trigger a bug are not that useful. They mostly make the system bigger and the test suite running more slowly.
The main job on the tests is to make sure that the tests work in all configurations. That is surely not the case right now. They work for the complete system built on Unix/Linux/… like systems. Several tests fail on Windows, I think mostly because the test needs to take care of limitations and variations that are ok. Many tests fail if threads are disabled, mostly because they should be conditional on threads, but some possibly for other reasons. Next it probably makes sense to run a C and Prolog coverage analysis and spot areas that are sensitive to bugs and have poor coverage.
I have a slightly different view of unit tests – they should aim for 100% coverage, especially with a “dynamically typed” language such as Prolog. (And if you work at Google, you’ll be indoctrinated with this view – and I’ve seen it pay off; for example, when Python changed its hashing function, many thousands of unit tests broke because the were testing code that had inadvertent dependency on hash ordering.)
On top of that, I expect corner case tests. So, for example, with “dict”, I would want tests for:
_{} :< foo throws a type exception
D{} :< d{x:1} succeeds with D==d
d{x:X} :< d{x:1} succeeds with X==1
dict_pairs(D, tag, [a-1,b-2,a-1]) throws a “duplicate key” exception
D=_{}, is_dict(D, tag) succeeds with D==tag{}.
etc.etc.
In particular, the exception should be as specific as possible, so that people know what they can depend on. E.g.
Even if something isn’t specified to this level of detail, people will depend on it anyway, leaving an opportunity for future breakage: https://www.hyrumslaw.com/
I applied this philosophy when I recently updated library(protobufs); 100% test coverage found 3 bugs that had been there for years, and edge cases found some more, which required rewriting the low-level serialization functions (the rewrite also took advantage of some newer functionality in SWI-Prolog).
There are two reasons for tests: make sure changes do not introduce bugs and be at least notified if behavior changes. I don’t think all behavior changes should be considered bugs. For example, I think it is fine if member/2 would at some point complain if the second argument is not a list. Changes may also affect semantically irrelevant ordering or the precise exception, etc. If we fix all that I fear there is little room for future enhancements. The ISO standard is an example of a too tight standard on some places. For example, call((fail,1)) must raise an exception in ISO, i.e. enforcing full analysis of the argument to call/1 prior to execution. Some systems (ECLiPSe) deal with the argument to call/1 in an opportunistic way and thus simply fail.
Also exceptions fall into two categories: those that are there to deal with an unexpected state of the environment, such as a non-existing file. Here it must be specified when there is an exception as well as what the shape of the formal part of the error(Formal, ImplDependent) is. Others (e.g., type errors) result from broken programs. Here the exact exception as well as when it is raised (for example the compiler may already refuse the program) is pretty irrelevant. The most important property is that the programmer is informed as early as possible and with enough detail to find the cause of the problem.
IMO, as a programmer you should always try to use your dependencies “the way they are intended”. Only if there is no other choice and the corner case behavior you want to rely on is documented you can ignore this rule.
As for tests, I have no counts, but I think the vast majority of the bugs originate from the C code. Only some complicated parts of the Prolog code suffer regularly from issues.
But it’s still OK to have a unit test that member(x,foo) fails – it could also have a comment saying that this behavior might change in the future (and when the behavior is changed, the test is also changed).
(Incidentally, d{a:1}:<foo throws an error – and it’s fine to have a test for this also and maybe a comment about how it’s inconsistent with member/2)
That’s the problem – it’s often not obvious how things were intended. As the examples of member/2 and (:<)/2 show, there are multiple possible design choices when the type is not what’s expected (cue @ridgeworks)
Partly because the C layer does complicated things (memory allocation, gc, threading, etc etc) and then presents a much simpler view to the Prolog level. But if you’re doing moderately complex things in Prolog (e.g., writing a compiler or implementing a standard (e.g. protobufs), you’ll probably end up with plenty of test cases in Prolog as well. (Protobufs has ~2000 lines of implementation and ~2700 lines of tests, although the tests are less “dense” than the implementation)
That’s fine for development, but if code is deployed in production, specific handling of exceptions might be needed (non-existent files, permission errors, resource limitations, etc etc)
Anyway, my experience with the Google code base is that even when some module was well-documented, I often ended up with questions about details that weren’t in the documentation but that were covered in the unit cases. (I’ve also had to deal with other large code bases that had far fewer tests; and had to read the code to figure out how it handled certain things, or had to write my own tests to figure out the behavior.)