Kernel panics in garbage collection?

I’ve been looking at a somewhat challenging CLP application that is resource (time and memory) intensive. When I run it with garbage collection disabled, it quickly runs out of memory with the standard “Stack limit exceeded” exception. When garbage collection is enabled, the vast majority of the time a MacOS kernel panic is generated (sometimes after hours) with cause:

  ...
mp_kdp_enter() timed-out on cpu 10, NMI-ing
mp_kdp_enter() NMI pending on cpus: 0 1 2 3 4 5 6 7 8 9 11 12 13 14 15
mp_kdp_enter() timed-out during locked wait after NMI;expected 16 acks but received 1 after 10945024 loops in 1500000000 ticks
panic(cpu 10 caller 0xffffff8000adb5da): "Machine Check at 0x00000001013a9c23, 
  ...

I interpret this (perhaps incorrectly) that some thread has a lock and won’t give it up. The process running is usually swipl, which isn’t surprising since not much else is running at the time. Occasionally the process is the kernel extension responsible for power management (com.apple.driver.AppleIntelCPUPowerManagement(220.0)).

On a couple occasions, more graceful aborts:

[PROLOG SYSTEM ERROR:  Thread 1 (main) at Sun Jul  7 18:43:40 2024
	relocation chains = 1
[While in 78345-th garbage collection]

C-stack trace labeled "SYSERROR":
  [0] save_backtrace() at  [0x104fbfdcc]


PROLOG STACK:
  [202241] clpBNR:getValue/2 [PC=25 in clause 1]
  [202240] clpBNR:doNode_/7 [PC=35 in clause 1]
  [202239] clpBNR:stableLoop_/2 [PC=42 in clause 2]
  [199806] clpBNR:stable_/1 [PC=11 in clause 2]
  [199793] clpBNR:eval_MS/4 [PC=17 in clause 1]
  [199792] clpBNR:iterate_MS/6 [PC=122 in clause 1]
  [12] system:<meta-call>/1 [PC=13 in clause -1]
  [11] $toplevel:toplevel_call/1 [PC=3 in clause 1]
  [10] $toplevel:stop_backtrace/2 [PC=4 in clause 1]
  [9] $tabling:$wfs_call/2 [PC=17 in clause 1]
]

[pid=904] Action? 

and

Failed to print resource exception due to lack of space
error(resource_error(stack),stack_overflow{choicepoints:8,depth:2731,environments:18,globalused:204039,localused:3,stack:[frame(2731,clpBNR:narrowing_op(mul,_52234302,($)/3,($)/3),[]),frame(2730,clpBNR:evalNode(mul,_52234338,($)/3,($)/3),[]),frame(2729,clpBNR:doNode_(($)/3,mul,_52234376,424,_52234380,(/)/2,_52234384),[]),frame(2728,clpBNR:stableLoop_((/)/2,424),[]),frame(151,clpBNR:stable_((/)/2),[])],stack_limit:204800,trailused:636})

Either of these is preferable to a kernel panic.

I’ve tried changing a few of the Prolog stack parameters (min_free and spare) but observed behaviour didn’t change. It occurs with both SWIP 9.2.5 and 9.3.5. I’m running an older version of MacOS (Mojave); it may be an OS bug, or even a hardware issue, but it’s certainly reproducible with SWIP.

I’m not really expecting any solutions (more graceful error handling) from this post, but any insights/suggestions welcome. If there’s any interest I can compose a SWISH notebook with the application test code although I’d be reluctant to run it on a production server.

Hard to say. The system error exception is 99% sure a bug in SWI-Prolog. For the rest, I’d first check the system has enough memory. The “Failed to print resource exception due to lack of space” is printed when it tries to print the out-of-stack error, but it has no space for doing so and it fails to allocate that space.

A kernel panic is by definition an OS bug. Shortage of memory is not an unlikely reason.

I’m curious how much memory the Mac has, and whether the bug might be reproduced on a Linux machine with similar memory. (I might have a machine I don’t mind crashing, but I need to apply some updates before I can rebuild swipl on it.)

It could be interesting to try changing some of the system values via the ulimit command (or whatever the Mac equivalent is).

Pretty sure it’s not a shortage of OS memory. The machine has 32GB, of which 22 GB is free.

I’m currently running the swipl stack at 250 MB to encourage lots of GC’s for earlier failure. It may be that that’s insufficient for error recovery, but neither increasing that (e.g., 1 GB) or modifying the prolog global stack parameters (see OP) is effective at altering the behaviour. If there’s some other useful config parameter I’m happy to try that.

Re machine memory, see above. According to the man pages there is a ulimit command, but that’s the extent of my knowledge on the subject.

I’ll create a SWISH notebook with necessary instructions and code and post the link on this topic should you choose to spend any time on this.

A link to the notebook:

https://swish.swi-prolog.org/p/_GC_kernelpanic_CLP.swinb

BTW, here’s what my Mac defaults are:

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 256
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2837
virtual memory          (kbytes, -v) unlimited

I ran the program from the notebook on my Linux machine. Gives this after a bit more than 10 hours:

2 ?- hxcgr(OF,_) ,global_minimize(OF,C,2).
ERROR: Stack limit (1.0Gb) exceeded
ERROR:   Stack sizes: local: 27Kb, global: 0.8Gb, trail: 44Kb
ERROR:   Stack depth: 4,647,196, last-call: 100%, Choice points: 4
ERROR:   Possible non-terminating recursion:
ERROR:     [4,647,196] clpBNR:tree_insert(<compound n/3>, <compound (-)/2>, _227096092)
ERROR:     [4,647,195] clpBNR:tree_insert(<compound n/3>, <compound (-)/2>, _227096126)
3 ?- statistics.
% Started at Thu Jul 11 21:18:35 2024
% 37331.538 seconds cpu time for 256,959,630,536 inferences
% 9,927 atoms, 6,811 functors, 5,534 predicates, 120 modules, 289,296 VM-codes
% 
%                     Limit   Allocated      In use
% Local  stack:           -      108 Kb    2,224  b
% Global stack:           -      128 Kb   17,192  b
% Trail  stack:           -      130 Kb      544  b
%        Total:    1,024 Mb      366 Kb   19,960  b
% 
% 12,348 garbage collections gained 2,636,942,236,712 bytes in 2695.190 seconds.
% 14 atom garbage collections gained 21,266 atoms in 0.010 seconds.
% 17 clause garbage collections gained 718 clauses in 0.000 seconds.
% Stack shifts: 3,322 local, 1,397 global, 1,687 trail in 972.685 seconds
% 3 threads, 0 finished threads used 0.000 seconds
true.

During the first 15 minutes where I watched memory usage, this was low (stacks <100M). The error report suggests that eventually it gets into a loop, running out of stack.

No panics. I have little clue on that. Kernel panics due to (mis)behaving applications are very rare these days AFAIK. Could it be malfunctioning memory? After all, this is a very memory intensive application.

Thanks @jan, that’s exactly the behaviour I would have expected - gradual increase in memory usage until the stack limit was reached. (The program is essentially a forward recursive search loop for a global minimum.) My concern was ungraceful termination due to the kernel panic - I’m now inclined to believe that’s a bug in one of the MacOS kernel extensions, perhaps power management. (If you recall the panic was related to a timeout on a lock.)

I have ECC memory and no errors reported so I doubt it was a memory/hardware problem. In any case it doesn’t look like a SWIP issue at this point.