Mqi test fails on 8.4.0 build

On my Arch Linux PC, after running ctest -j 4 during the build steps for 8.4.0, I get the following:

99% tests passed, 1 tests failed out of 70

Total Test time (real) =  56.73 sec

The following tests FAILED:
	 46 - mqi:mqi (Failed)
Errors while running CTest

My ~/src/swipl-8.4.0/build/Testing/Temporary/LastTest.log looks like this

Start testing: Oct 06 09:48 SAST
----------------------------------------------------------

46/70 Testing: mqi:mqi
46/70 Test: mqi:mqi
Command: "/home/roblaing/src/swipl-8.4.0/build/src/swipl" "-p" "foreign=:/home/roblaing/src/swipl-8.4.0/build/packages/plunit" "-f" "none" "--no-packs" "--on-error=status" "
-s" "/home/roblaing/src/swipl-8.4.0/packages/mqi/test_mqi.pl" "-g" "test_mqi" "-t" "halt"
Directory: /home/roblaing/src/swipl-8.4.0/build/packages/mqi
"mqi:mqi" start time: Oct 06 09:48 SAST
Output:
----------------------------------------------------------
% PL-Unit: py_mqi_fast 
% swipl in dir '/home/roblaing/src/swipl-8.4.0/build/src'; Packed args: '-p~|~foreign=:/home/roblaing/src/swipl-8.4.0/build/packages/plunit~|~-f~|~none~|~--no-packs~|~--on-e
rror=status~|~-s~|~/home/roblaing/src/swipl-8.4.0/packages/mqi/test_mqi.pl~|~-g~|~test_mqi~|~-t~|~halt'
test_async_query (test_prologserver.TestPrologMQI) ... **** Note that some builds of Prolog will print out messages about
'Execution Aborted' or 'did not clear exception...' when running tests.  Ignore them.
skipped ok
test_async_query_slow (test_prologserver.TestPrologMQI) ... skipped ok
test_connection_close_with_running_query (test_prologserver.TestPrologMQI) ... skipped ok
test_connection_failure (test_prologserver.TestPrologMQI) ... ok
test_debugging_options (test_prologserver.TestPrologMQI) ... skipped ok
test_goal_thread_failure (test_prologserver.TestPrologMQI) ... skipped ok
test_json_to_prolog (test_prologserver.TestPrologMQI) ... ok
test_multiple_connections (test_prologserver.TestPrologMQI) ... skipped ok
test_multiple_serial_connections (test_prologserver.TestPrologMQI) ... skipped ok
test_protocol_edge_cases (test_prologserver.TestPrologMQI) ... skipped ok
test_python_classes (test_prologserver.TestPrologMQI) ... ok
test_quit (test_prologserver.TestPrologMQI) ... ok
test_server_options_and_shutdown (test_prologserver.TestPrologMQI) ... FAIL
test_server_options_and_shutdown_slow (test_prologserver.TestPrologMQI) ... skipped ok
test_sync_query (test_prologserver.TestPrologMQI) ... skipped ok
test_sync_query_slow (test_prologserver.TestPrologMQI) ... skipped ok
test_unix_domain_socket_embedded (test_prologserver.TestPrologMQI) ... ok
test_unknown_command (test_prologserver.TestPrologMQI) ... ok
test_write_output_to_file_in_embedded_mode (test_prologserver.TestPrologMQI) ... ok

======================================================================
FAIL: test_server_options_and_shutdown (test_prologserver.TestPrologMQI)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/roblaing/src/swipl-8.4.0/packages/mqi/python/test_prologserver.py", line 845, in test_server_options_and_shutdown
    self.assertEqual(afterShutdownThreads, initialThreads)
AssertionError: Lists differ: ['mai[59 chars]running', 'mqi1_conn1_comm:running', 'mqi2_conn2_comm:running'] != ['mai[59 chars]running', 'mqi1_conn1_comm:running']

First list contains 1 additional elements.
First extra element 5:
'mqi2_conn2_comm:running'

  ['main:running',
   'gc:running',
   'mqi1:running',
   'mqi1_conn1_goal:running',
-  'mqi1_conn1_comm:running',
?                           ^

+  'mqi1_conn1_comm:running']
?                           ^

-  'mqi2_conn2_comm:running']

----------------------------------------------------------------------
Ran 19 tests in 5.189s

FAILED (failures=1)
ERROR: /home/roblaing/src/swipl-8.4.0/packages/mqi/test_mqi.pl:83:
        test mqi: wrong answer (compared using ==)
ERROR:     Expected: exit(0)
ERROR:     Got:      exit(1)
 done
% 1 test failed
% 0 tests passed
ERROR: -g test_mqi: false
<end of output>
Test time =   5.66 sec
----------------------------------------------------------
Test Failed.
"mqi:mqi" end time: Oct 06 09:49 SAST
"mqi:mqi" time elapsed: 00:00:05
----------------------------------------------------------

End testing: Oct 06 09:49 SAST

I have been seeing the same in the swipl-devel repo. It happened in the 8.3.x and now it also happens in 8.5.0.

Also on Archlinux btw. Since the whole OS is on the “bleeding edge” I have started ignoring problems that don’t directly break something I am working on.

On various systems this is due to a missing “pgrep” executable. The package holding that should be in the build dependencies. On some Linux systems you get pgrep from one of the core packages, on others you have to install it explicitly. @ericzinda, we could also consider dropping the tests if pgrep is not installed?

I just checked, and pgrep is installed by default on Arch Linux, so not sure that would fix it.

Indeed. Seems some Python thread is not terminating if I read the test correctly (test_prologserver.py, line 845). @ericzinda ?

Thanks for reporting! I’ll take a look at this today. Disabling the test for lack of pgrep was on the list, so I’ll do that, and I’ll also try to figure out what is up with the particular failure above.

@joeblog and @Boris: I’m hoping this is just a race condition in the test where threads are just taking longer to exit on your systems and my test was poorly written to handle it.

Here is an updated test that gives them time if they haven’t exited yet (just remove the .log at the end of the filename): test_prologserver.py.log (56.4 KB)

Could you try it on your systems and see if the test passes? Just replace test_prologserver.py on your system with that file and rerun the test. I’ve also fixed the dependency on pgrep but I think we established this isn’t a problem here.

Great @ericzinda, it seems to help. I found two files with that name

./packages/mqi/python/test_prologserver.py
./packages/language_server/python/test_prologserver.py

and I replaced the first one with the one that you shared here. Now the mqi tests all pass for me. Of course I have no idea what it all does.

I think I saw this on my system (ubuntu 20.0.4, 4 cores, running ctest -j4) but it disappeared when I re-ran ctest. Of course, when I tried to reproduce the problem, it didn’t reproduce (after I reinstalled the OS, which seems to have improved performance for a few things …).

100% tests passed, 0 tests failed out of 70

I replaced swipl-8.4.0/packages/mqi/python/test_prologserver.py

I do have a fairly vintage, slow PC, so that may have been the problem.

You should remove the directory ./packages/language_server. That was the old name and the renaming of the git submodule doesn’t seem to wipe the directory. It is harmless but confusing.

Does this mean the timing is now more relaxed or that the issue simply cannot happen anymore? The core system contains a couple of tests that cannot be guaranteed under extreme timings. In these cases I merely print a warning if the test fails. Making the test fail will cause installation failures if the installation includes testing. It is really hard to rely on timing as some of these builds happen on heavily loaded, slow and/or with only one core. Maybe we should add an environment variable that allows such tests to succeed by default, but fail for developers that have this flag set?

The test is checking to make sure the threads it starts actually stop. So, the update to the test I sent waits until a timeout for the thread to exit, but it could still fail if the system is really overloaded.

Yeah, it seems like your approach is good. Let me know what environment variable you want to use for the system and I’ll update the test to: wait 3 seconds for the thread to exit (the behavior in the update I posted) and then fail or print a message depending on the environment variable.

Let us invent something. What about SWIPL_TEST_FAIL_ON_UNLIKELY. It is a bit long, but clear and only needed for developers. So, if a test cannot be guaranteed to succeed you do this.

(    the real test
->   true
;    \+ getenv('SWIPL_TEST_FAIL_ON_UNLIKELY', y)
).

In most cases things being slow does not need to be a problem. I typically use a polling loop to check for completion (say) about 10 times per second, giving up after a really long time. For example if 1 second is normal, wait upto a minute before signalling an test failure. I’m using this mostly for atom and clause GC that are conservative collectors and thus sometimes fail to collect something that is unreachable.

That’s good to know! I was trying to be more conservative with that timeout but I’ll turn it to a minute too, and increase the polling rate a bit (I think I’m doing 1 a second at the moment).

The Environment variable sounds fine to me. I’ll update the test for that and the timing and update the MQI repository.