Kernel panics in garbage collection?

ridgeworks · July 10, 2024, 4:24pm

I’ve been looking at a somewhat challenging CLP application that is resource (time and memory) intensive. When I run it with garbage collection disabled, it quickly runs out of memory with the standard “Stack limit exceeded” exception. When garbage collection is enabled, the vast majority of the time a MacOS kernel panic is generated (sometimes after hours) with cause:

  ...
mp_kdp_enter() timed-out on cpu 10, NMI-ing
mp_kdp_enter() NMI pending on cpus: 0 1 2 3 4 5 6 7 8 9 11 12 13 14 15
mp_kdp_enter() timed-out during locked wait after NMI;expected 16 acks but received 1 after 10945024 loops in 1500000000 ticks
panic(cpu 10 caller 0xffffff8000adb5da): "Machine Check at 0x00000001013a9c23, 
  ...

I interpret this (perhaps incorrectly) that some thread has a lock and won’t give it up. The process running is usually swipl, which isn’t surprising since not much else is running at the time. Occasionally the process is the kernel extension responsible for power management (com.apple.driver.AppleIntelCPUPowerManagement(220.0)).

On a couple occasions, more graceful aborts:

[PROLOG SYSTEM ERROR:  Thread 1 (main) at Sun Jul  7 18:43:40 2024
	relocation chains = 1
[While in 78345-th garbage collection]

C-stack trace labeled "SYSERROR":
  [0] save_backtrace() at  [0x104fbfdcc]


PROLOG STACK:
  [202241] clpBNR:getValue/2 [PC=25 in clause 1]
  [202240] clpBNR:doNode_/7 [PC=35 in clause 1]
  [202239] clpBNR:stableLoop_/2 [PC=42 in clause 2]
  [199806] clpBNR:stable_/1 [PC=11 in clause 2]
  [199793] clpBNR:eval_MS/4 [PC=17 in clause 1]
  [199792] clpBNR:iterate_MS/6 [PC=122 in clause 1]
  [12] system:<meta-call>/1 [PC=13 in clause -1]
  [11] $toplevel:toplevel_call/1 [PC=3 in clause 1]
  [10] $toplevel:stop_backtrace/2 [PC=4 in clause 1]
  [9] $tabling:$wfs_call/2 [PC=17 in clause 1]
]

[pid=904] Action?

and

Failed to print resource exception due to lack of space
error(resource_error(stack),stack_overflow{choicepoints:8,depth:2731,environments:18,globalused:204039,localused:3,stack:[frame(2731,clpBNR:narrowing_op(mul,_52234302,($)/3,($)/3),[]),frame(2730,clpBNR:evalNode(mul,_52234338,($)/3,($)/3),[]),frame(2729,clpBNR:doNode_(($)/3,mul,_52234376,424,_52234380,(/)/2,_52234384),[]),frame(2728,clpBNR:stableLoop_((/)/2,424),[]),frame(151,clpBNR:stable_((/)/2),[])],stack_limit:204800,trailused:636})

Either of these is preferable to a kernel panic.

I’ve tried changing a few of the Prolog stack parameters (min_free and spare) but observed behaviour didn’t change. It occurs with both SWIP 9.2.5 and 9.3.5. I’m running an older version of MacOS (Mojave); it may be an OS bug, or even a hardware issue, but it’s certainly reproducible with SWIP.

I’m not really expecting any solutions (more graceful error handling) from this post, but any insights/suggestions welcome. If there’s any interest I can compose a SWISH notebook with the application test code although I’d be reluctant to run it on a production server.

jan · July 10, 2024, 5:45pm

Hard to say. The system error exception is 99% sure a bug in SWI-Prolog. For the rest, I’d first check the system has enough memory. The “Failed to print resource exception due to lack of space” is printed when it tries to print the out-of-stack error, but it has no space for doing so and it fails to allocate that space.

A kernel panic is by definition an OS bug. Shortage of memory is not an unlikely reason.

peter.ludemann · July 10, 2024, 6:08pm

I’m curious how much memory the Mac has, and whether the bug might be reproduced on a Linux machine with similar memory. (I might have a machine I don’t mind crashing, but I need to apply some updates before I can rebuild swipl on it.)

It could be interesting to try changing some of the system values via the ulimit command (or whatever the Mac equivalent is).

ridgeworks · July 10, 2024, 6:23pm

Pretty sure it’s not a shortage of OS memory. The machine has 32GB, of which 22 GB is free.

I’m currently running the swipl stack at 250 MB to encourage lots of GC’s for earlier failure. It may be that that’s insufficient for error recovery, but neither increasing that (e.g., 1 GB) or modifying the prolog global stack parameters (see OP) is effective at altering the behaviour. If there’s some other useful config parameter I’m happy to try that.

Re machine memory, see above. According to the man pages there is a ulimit command, but that’s the extent of my knowledge on the subject.

I’ll create a SWISH notebook with necessary instructions and code and post the link on this topic should you choose to spend any time on this.

ridgeworks · July 10, 2024, 6:45pm

A link to the notebook:

https://swish.swi-prolog.org/p/_GC_kernelpanic_CLP.swinb

ridgeworks · July 10, 2024, 6:51pm

BTW, here’s what my Mac defaults are:

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 256
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2837
virtual memory          (kbytes, -v) unlimited

jan · July 12, 2024, 7:45am

I ran the program from the notebook on my Linux machine. Gives this after a bit more than 10 hours:

2 ?- hxcgr(OF,_) ,global_minimize(OF,C,2).
ERROR: Stack limit (1.0Gb) exceeded
ERROR:   Stack sizes: local: 27Kb, global: 0.8Gb, trail: 44Kb
ERROR:   Stack depth: 4,647,196, last-call: 100%, Choice points: 4
ERROR:   Possible non-terminating recursion:
ERROR:     [4,647,196] clpBNR:tree_insert(<compound n/3>, <compound (-)/2>, _227096092)
ERROR:     [4,647,195] clpBNR:tree_insert(<compound n/3>, <compound (-)/2>, _227096126)
3 ?- statistics.
% Started at Thu Jul 11 21:18:35 2024
% 37331.538 seconds cpu time for 256,959,630,536 inferences
% 9,927 atoms, 6,811 functors, 5,534 predicates, 120 modules, 289,296 VM-codes
% 
%                     Limit   Allocated      In use
% Local  stack:           -      108 Kb    2,224  b
% Global stack:           -      128 Kb   17,192  b
% Trail  stack:           -      130 Kb      544  b
%        Total:    1,024 Mb      366 Kb   19,960  b
% 
% 12,348 garbage collections gained 2,636,942,236,712 bytes in 2695.190 seconds.
% 14 atom garbage collections gained 21,266 atoms in 0.010 seconds.
% 17 clause garbage collections gained 718 clauses in 0.000 seconds.
% Stack shifts: 3,322 local, 1,397 global, 1,687 trail in 972.685 seconds
% 3 threads, 0 finished threads used 0.000 seconds
true.

During the first 15 minutes where I watched memory usage, this was low (stacks <100M). The error report suggests that eventually it gets into a loop, running out of stack.

No panics. I have little clue on that. Kernel panics due to (mis)behaving applications are very rare these days AFAIK. Could it be malfunctioning memory? After all, this is a very memory intensive application.

ridgeworks · July 12, 2024, 1:42pm

Thanks @jan, that’s exactly the behaviour I would have expected - gradual increase in memory usage until the stack limit was reached. (The program is essentially a forward recursive search loop for a global minimum.) My concern was ungraceful termination due to the kernel panic - I’m now inclined to believe that’s a bug in one of the MacOS kernel extensions, perhaps power management. (If you recall the panic was related to a timeout on a lock.)

I have ECC memory and no errors reported so I doubt it was a memory/hardware problem. In any case it doesn’t look like a SWIP issue at this point.

Topic		Replies	Views
How do i resolve stack overflow errors? Help!	10	1878	August 13, 2019
What and how to get information on silent quit after heavy use of memory to consult experts Predicate	7	673	July 27, 2022
Segfault happening some time after 9.3.0 General	12	122	July 26, 2024
Performance implications of `--threads=false` in embedded swipl General	14	516	October 16, 2023
Crash when initializing GMP Predicate bug	5	153	September 22, 2025

Kernel panics in garbage collection?

Related topics