Improving unit tests

I’m looking into improving the unit test system, notably

  • Add concurrency. Actually, there is some start, but it is broken.
  • Improve the output.
  • Add timeouts.

The current feedback prints the unit and a single character for each test. If a test fails it will nicely display the output and location (file:line) of the failed test, but if the system crashes (as is not uncommon for the internal tests :slight_smile: ) or the system hangs you have to count dots and try to fin the relevant test (if Prolog is still functioning well and the test runs pure Prolog, Ctrl-C will tell you which test is hanging). Here I want to learn from ninja and ctest. I notably like the single output line and not cluttered output from concurrent processes. ctest has the nice feature of only providing output for failed tests. I would like Ninja’s feedback, but with a line per (concurrent) test, so you know at any time which tests are running. If the output is not a tty we should get output similar to
ctest.

Some progress is already in place, notably improving the test data structures, events and allow capturing test output, so we can decide whether to print it (on failure) or not.

Next to the terminal feedback we also want machine readable feedback, which can either be sent to a file or replace the terminal output. Then the question is which formats. I checked Test Results Format Notes - eLinux.org. The Linux kernel project seems to be moving from TAP to KTAP, apparently due to slow development of the TAP protocol, as it says:

The effort to create TAP version 14 has stalled.

Reading TAP 14 specification - Test Anything Protocol this seems no longer to be true. Is TAP a good output format?

1 Like

FYI for others

One should not confuse the SWI-Prolog Prolog Unit Tests (plunit) which is included with most install versions of SWI-Prolog with the SWI-Prolog pack TAP.

I am not sure if Prolog Unit Tests are included with WASM version but would not expect them to be installed by default.
Jan W. corrected my knowledge hole in the next reply.

The TAP package probably needs some updates as it is based on the message events produced by PlUnit and some details changed. Most likely it gets simpler :slight_smile: An obvious direction would be to update the TAP package and make it part of the default PlUnit package. @fnogatz, what do you think?

The WASM version includes library(plunit).

One thought that might be nice is to persist the output results into a Prolog file (thinking library(persistency)) so that the results can be loaded and queried directly. This would be in addition to a formal format as proposed.

The nice think about having a persisted Prolog file is that it could include info that is not part of a standard format which can also be queried with Prolog.


If I understand the proposal correctly then one of the side benefits of a standard output format would be the ability to use a third party GUI that understands the format to control the running of the tests. Note: Image only to get an idea of what such a GUI interface would look like, not as a recommendation for the particular unit framework.

image

(Image from Wikipedia CsUnit)


The only other unit testing framework I have had extensive experience with in the past is xUnit.net which derived from xUnit. I used it often when developing F# code and it made running test enjoyable, writing test were still a pain. A side benefit of the GUI was also seeing code coverage and grouping test into categories (think labels) such as new tests for a release, bug, enhancement, … .

@jan

Is the plan to build all of this using hooks as is done with plunit or is something new/different needed because of a test causing a crash?

Unsure if related: The tests from ctest seem to differ from the tests from test_installation.

Before I run this, I change something in swipl-devel/src/Tests/library/test_date.pl, for example,

ok(1152794050, ‘%Z’, ‘CEST’) to ok(1152794050, ‘%Z’, ‘CESTxxx’).

swipl-devel/build$ cmake -DINSTALL_TESTS=ON -DCMAKE_INSTALL_PREFIX=$HOME/swipl ..
swipl-devel/build$ make
swipl-devel/build$ ctest
(tests pass, no error message)
swipl-devel/build$ make install
swipl-devel/build$ /home/matthias/swipl/bin/swipl
?- test_installation.
(test for library fails, with error message: '%Z': got 'CEST', expected 'CESTxxx')

Although concurrency would be nice, I have quite a few tests that set global things (via setup or cleanup) that probably can’t run concurrently. :frowning:

Counting dots is a poor way of finding a crash. I often end up running tests by hand (copy&paste in emacs, so it’s not too difficult), to find the source of a crash.

If a test has multiple results, it’s a bit clunky:

test(foo, [A,B] == [one, two]) :-
    A=one, B=two.

It’d be nice to be able to specify it like this:

test(foo, [A==one, B=two]) :-
    A=one, B=two.

I often end up writing assertion(A==one) in my test body to make the tests more readable when the individual results are complex terms.

There are also other tests I’d like to specify, such as ground(A) – currently, I need to do assertion(ground(A)).

This will simply use the PlUnit hooks. Current work is on extending these hooks to make it easier to create better output. Prolog crashes will stop processing the remainder of the tests. It should become easy to find which test caused the crash from the log output.

I think you have to install and run the installed version

That would indeed be easier. It is completely orthogonal to the current changes though.

Not sure about that. Seems to open an infinite number of other tests on the output. This test only applies if the term is complex or the details are not well defined. In other cases use ==/2. We have assertion/1 for dealing with arbitrary conditions.

I thought I did that, see above

swipl-devel/build$ cmake -DINSTALL_TESTS=ON -DCMAKE_INSTALL_PREFIX=$HOME/swipl ..
swipl-devel/build$ make
swipl-devel/build$ ctest
(tests pass, no error message)
swipl-devel/build$ make install
swipl-devel/build$ /home/matthias/swipl/bin/swipl <<< isn’t this the installed version?
?- test_installation.
(test for library fails, with error message: '%Z': got 'CEST', expected 'CESTxxx')

The CESTxxx indicates that the change was picked up, no? Isn’t the real issue that the test succeeds under ctest and fails under test_installation/0? That could well be if the environment differs (in this case locale, timezone, …?) The date tests are pretty sensitive to such issues.

Yes, indeed, that was my point. The test fails in test_installation, but not in ctest.

But I doubt that it passes in ctest because of Locale issues (there is no timezone CESTxxx), my guess is rather that it is not even invoked under ctest (?).

It is invoked fine. If you look at test_data.pl though, you’ll see it merely prints failures and doesn’t make the test fail. As ctest normally only produces output if the test fails, nothing is ever printed. You can use ctest -V to get the output. As test_installation/0 doesn’t hide any output, it behaves as ctest -V. Note that for both, the “failure” doesn’t cause the test to fail.

The more general problem is that there are tests that may fail as part of build/post installation for which we do not want the build/installation to fail. This is one such case. Different implementations for the time formatting primitives give different output. As people may install in environments we do not know about, some of these tests may fail. Looking at it, we should probably mark which tests have well defined output and which not.

Also a few of the concurrency tests are based on timing and can fail under extreme conditions.

Ideally ctest would have some convention to emit messages even if the test succeeds, so we can print a “warning” and still succeed. I’m not aware of such an option. Something to keep in mind for the new PlUnit output :slight_smile:

1 Like

Actually, the library(tap) is already very simple :wink: It is not based on PlUnit and works out of the box with the recent versions of SWI-Prolog. Not sure if it’s useful to make it part of the PlUnit package, as it’s currently not tightly coupled with it. However, I’m open to assist in any development with library(tap)!

Thanks. I think there was some TAP compatible output library that used PlUnit events to generate the TAP output. Not this one … I would like to maintain PlUnit as the basis of the test framework. Surely it is not perfect, but it is fairly robust and well accepted.

it would be nice if failed tests where written in red to tie in with the “Red, Green, Refactor” school of development. (Could also be I’m out of date since my example screenshot is quite vintage.)

1 Like

Hi,

thank you for adding timeouts to the test library.

However, as I posted some time ago, I believe - strongly - that you need to expose a predicate that allowed you to access the test(s) result programmatically.

That way you can use the library in more interesting ways. The three predicates should be the following (the ‘2’ is just a rename of other predicates in the library to avoid conflicts.

report2(Passed, Failed, FailedAssertion, Blocked, STO) :- ...

run_tests_and_report(report(Passed,
			                Failed,
			                FailedAssertions,
			                Blocked,
			                STO)) :- ...

report_and_cleanup2(Ref,
		            report(Passed,
			               Failed,
			               FailedAssertions,
			               Blocked,
			               STO)) :- ...

I have code that works and I can share it.

All the best

Marco

What are you trying to achieve? The current work involves a lot of changes to the message terms that are responsible for the feedback. It passes more information, such as the progress in the test suite, resources used by the test (time, space) and, optionally, the captured output from the test. It also makes all information of a finished test available in the final message such that we can produced non-cluttered output when running tests concurrently.

The overall idea is that if you want to do something else than the default, you (re)define some of the message terms using message_hook/3. If you give me an idea about you want to achieve, we can see whether that is already covered, some additions are needed or some more radical changes.

Hi Jan

the point of my proposal is that you should be able to bypass message_hooks/3 and other message machinery if you want. The results you get from the predicates I propose (which do work, BTW) can be used to write whatever report you need, and, above all, do computations with the tests results.

A typical use is this:

testing_run :-
    writeln('>>> Starting testing run.'),
    run_tests_and_report(report(Passed,
                                Failed,
                                FailedAssertions,
                                Blocked,
                                STO)),
    writeln('>>> Finished testing run.'),
    testing_run_report(Passed, Failed, FailedAssertions, Blocked, STO).

The extension should be visible here: plunit_reporting link.

If not I can send you a .zip (Discourse does not allow uploading of zip files).

All the best

Marco

Discourse has an option that admins of this site can change to allow zip files to be uploaded but IMO there is no need to enable it as very few users would download a zip file and it takes up our limited storage space. If the code is small and just a pl file that can be uploaded, or if it is a page or so in size that can be posted in a reply so that users can see the code immediately.

The file types that are currently enabled for uploading by non-staff users are:

  • jpg
  • jpeg
  • png
  • gif
  • svg
  • pl
  • c
  • log

FYI - I tried plunit_reporting link which takes me to a Google Drive page which responded with Access Denied.

Unfortunately, both the message terms and the internal dynamic data change, so extensions need to be updated. If you grant access to the Google doc we should get a clearer picture of what you want. Possibly we can define stable interfaces to support your demands.

As a side note, I’m wondering with to do with sto(true). This was promoted actively by Ulrich Neumerkel and runs all tests three times

  1. normal (allowing unification to produce cyclic terms)
  2. with the occurs_check flag set to true, causing such unifications to fail
  3. with the occurs_check flag set to error, causing such unifications to raise an exception.

It seriously complicates the testing machinery as we must run the test three times and then compare the results. If they are consistent, report normally, else report an “sto” issue. All this can probably be resolved, but at the cost of a lot of complicated code to do something that is typically not very interesting. I consider to leave the machinery around that allows checking in a particular mode, which is just a shorthand for a setup handler that changes the flag and a cleanup handler that restores it. We would allow for this flag at the unit and test level.

In my experience I mostly switched off this feature for expensive tests and I do not recall that it ever told me something interesting. Mostly false results, claiming a failed test was sensitive to STO because it failed in different ways due to the use of random/1 or other features that cause non-repeatable results.

Of course, this assessment gets different if there are enough people who actively use this feature. Any one?