We have a segfault

Hello everyone.

I’m on the latest development version of SWI-Prolog on multiple platforms (Windows 11, WSL running Ubuntu 22.04 and on Fedora 38).

SWI-Prolog crashes on all three platforms when I reload a file with make/0, after I’ve uncommented some code lines in it, but only after I first make a query.

On Linux I get the following output:

[debug]  ?- make.
Loading experiment file module data(poker_examples/palindrome_weak_setting.pl) from palindrome.

ERROR: Received fatal signal 11 (segv)
Time: Wed Feb 19 22:53:54 2025
Inferences: 10521727
Thread: 1 (main)
C-stack trace labeled "crash":
  [0] save_backtrace() at /home/mememe/swipl-devel/src/os/pl-cstack.c:337 [0x7f233f3bc818]
  [1] sigCrashHandler() at /home/mememe/swipl-devel/src/os/pl-cstack.c:937 [0x7f233f35a1c4]
  [2] __restore_rt() at ??:? [0x7f233f0fbbb0]
  [3] _ZN8tcmalloc11ThreadCache21ReleaseToCentralCacheEPNS0_8FreeListEji() at ??:? [0x7f233f44f280]
  [4] _ZN8tcmalloc11ThreadCache8ScavengeEv() at ??:? [0x7f233f45059f]
  [5] freeHeap() at /home/mememe/swipl-devel/src/pl-alloc.c:215 [0x7f233f3b1db4]
  [6] cleanDefinition() at /home/mememe/swipl-devel/src/pl-proc.c:1882 [0x7f233f3a3554]
  [7] endReconsult___LD() at /home/mememe/swipl-devel/src/pl-srcfile.c:1637 [0x7f233f31a194]
  [8] pl_fixup_reconsult1_va() at /home/mememe/swipl-devel/src/pl-srcfile.c:1779 [0x7f233f3a96a4]
  [9] PL_next_solution___LD() at /home/mememe/swipl-devel/src/pl-vmi.c:4344 [0x7f233f3650af]
  [10] callProlog() at /home/mememe/swipl-devel/src/pl-pro.c:530 [0x7f233f37bf6d]
  [11] pl_sig_atomic1_va() at /home/mememe/swipl-devel/src/pl-pro.c:422 [0x7f233f37c005]
  [12] PL_next_solution___LD() at /home/mememe/swipl-devel/src/pl-vmi.c:4344 [0x7f233f3650af]
  [13] query_loop() at /home/mememe/swipl-devel/src/pl-pro.c:171 [0x7f233f3a1493]
  [14] prologToplevel() at /home/mememe/swipl-devel/src/pl-pro.c:656 [0x7f233f3a132d]
  [15] PL_toplevel() at /home/mememe/swipl-devel/src/pl-fli.c:4999 [0x7f233f3b55aa]
  [16] swipl(+0x10b0) [0x4010b0]
  [17] __libc_start_call_main() at ??:? [0x7f233f0e5b8a]
  [18] __libc_start_main_alias_2() at :? [0x7f233f0e5c4b]
  [19] swipl(+0x10e5) [0x4010e5]


PROLOG STACK (without arguments):
  [38] system:$fixup_reconsult/1 <foreign>
  [37] system:$load_file/4 [PC=92 in clause 1]
  [36] setup_call_cleanup/3 [PC=5 in clause 1]
  [34] system:$consult_file/5 [PC=22 in clause 2]
  [33] system:$do_load_file_2/5 [PC=221 in clause 1]
  [30] system:$qdo_load_file/4 [PC=10 in clause 1]
  [28] setup_call_cleanup/3 [PC=5 in clause 1]
  [27] system:$c_call_prolog/0 [PC=0 in top query clause]
  [26] system:sig_atomic/1 <foreign>
  [23] system:$load_file/3 [PC=12 in clause 1]


PROLOG STACK (with arguments; may crash if data is corrupted):
Segmentation fault (core dumped)

Note that this happens with other files, not just the one listed above at the start, right after make is called.

I made a new branch on my repo with everything setup so as to reproduce the error with minimal fuss. I also opened an issue with instructions to help reproduce it. See here for the issue:

This is a link to the project:

https://github.com/stassa/vanilla/

The branch for the reproduction is poker_segv.

Thanks. I think this should be fixed with fdfb4f233a60285d47679e91708a42f3f7b7c452. Nice catch as I suspected something was wrong with wrapping and unwrapping predicates. This happens with all the dynamic table/1 and untable/1 calls you make. I didn’t look into what happens, but this surely was not how tabling was intended to be used. If the purpose is to get rid of the tables, use one of abolish*table predicates. There is a whole bunch of them, ranging from abolishing the table for a particular variant to abolishing all tables.

1 Like

Thanks Jan! I know I’m abusing table/1 and untable/1 with dynamic predicates- that hadn’t caused trouble until now so I kept using them.

It all has to do with the roundabout way I’m loading training data into my ILP systems. I described what I do here:

It’s a bit complicated but, in short, I have an inductive meta-interpreter for ILP that is trained with data asserted to the dynamic database. I have to table the meta-interpreter because if I don’t it happily constructs, and tries to prove, left-recursive programs, and then it can go infinite. It doesn’t help to table the programs it constructs because it’s the meta-interpreter itself that goes infinite.

The problem is that if I change the training data, with assert/retract, and re-run the meta-interpreter, I sometimes get the results that are consistent with the original training data, before the database update. I’m not sure exactly how or why that happens (I think the updated training data are tabled as dynamic and incremental automatically - which is neat btw) but it seems clear that the meta-interpreter is retrieving the results stored in the tables and ignoring the updated training data.

So I need to clean up the tables between training runs. I can use abolish_all_tables/0 for that, like the Dox suggest, but if I untable the meta-interpreter, I have to table it again the next time it runs or it will go infinite. So I use table/1. Is that’s what’s causing trouble? And how else can I restart the tabling process?

To be honest I’d rather not have to do the tabling/untabling all the time because it can be expensive, and slows things down if I set a large table space limit, but I can’t think of what else to do. I don’t want to force the user to start a new session for every new training run.

Well, table/1 and untable/1 should of course work and it is nice that the memory management issue with it is resolved. It is precisely the same as abolishing the tables for that predicate using abolish_table_subgoals/1 though.

That should be fine if all dynamic predicates as well as all tabled predicates are flagged as incremental. Of course, do not use abolish/1 as that looses the properties (abolish/1 is a useless predicate).

I don’t understand. Just deleting its tables using abolish_all_tables/0 should be fine. The predicates remain tabled, just the tables are gone. They will be recreated when needed. Semantically, abolishing tables is a no-op (provided the program is not changed). It just forces re-building the tables on the next call.

Whether using incremental tabling or just abolish all tables is better depends on the scale of the changes. If more or less everything is changed, abolishing is probably faster. If updates only require a minority of the tables to be recomputed, incremental tabling will win. It is not for free tough. It needs to track dependencies, propagate changes and plan recomputation.

An alternative might be to run the ILP checkers in a thread. By default the tables are thread-local, so they vanish with the thread. It also allows evaluating concurrently.

1 Like

OK. I hadn’t thought of that. I think I convinced myself that re-tabling was necessary and then I didn’t think about it anymore. Thanks, I’ll try rewriting everything with that in mind.

That’s an option. I added some multi-threading recently and I’m still finding my way around it. One of the PhD students I was working with re-wrote my algorithm with multi-threading natively in Rust and it went really fast (he also didn’t have to contend with the meta-interpretation overhead which is probably considerable).

The problem is I don’t know how to parallelise a Prolog meta-interpreter, inductive or not. My inductive meta-interpreter has the same structure as a vanilla Prolog meta-interpreter so it should be possible to create a new thread for every branch of the SLD-Refutation proof (when you pop a new goal from the “stack”) but I haven’t tried that and some literature on parallel Prolog execution I’m aware of is discouraging so I haven’t tried it yet. But I think it’s time I give it a try.

Btw. after I introduced multithreading I noticed that I run out of table space much less often. It seems that the multithreading primitives improve the usage of RAM resources too. I didn’t expect that but it’s cool. Tbh though I’m doing everything I can to reduce my system’s need for RAM -the tables keep blowing up.

Also I should note that so far the improvements I’ve seen with multi-threading are mainly in the usage of RAM, as I say above, and in CPU usage (kind of obviously) but I haven’t seen any speedups. There’s even a bit of overhead both in terms of inferences and execution time, though not great.

It’s possible the parts of the application I’m using multithreading with are just not the kind that can benefit the most from it. As I say above I’m still finding my way around it.

In ILP (as I understand it), the main bottleneck is evaluating all positive and negative examples on a newly proposed program, no? You can split the set of examples into N bins and create a thread for each bin. You could use concurrent_forall/2 to hand the examples to a pool of threads.

Alternatively, most ILP systems do some form of random generation and hill climbing AFAIK. You could have multiple threads doing that, optionally with some mechanism that terminates poorly performing threads early.

The overall idea for Prolog threads is to give them a sizeable job. Both creating a thread and feeding it with data and retrieving the data are one to two orders of magnitude slower then local computation.

Pretty unrelated, I’d guess we can try to use LLMs for proposing and repairing programs? Surely this has been tried …

1 Like

Careful because “ILP” has changed a lot in the last ten years. I like to say there’s now a New Wave of ILP - and it differs drastically from the past. For example modern systems don’t, as a general rule, do hill climbing anymore (more precisely, earlier systems would often optimise “coverage” the ratio of positive and negative examples entailed by a candidate program). I’ve had long conversations where my interlocutor used “ILP” to mean just Aleph, or sometimes the general approach based on bottom clause construction. That’s not what modern ILP does.

As an example, Meta-Interpretive Learning which I study is second-order SLD-Resolution, i.e. basicaly Prolog but with second-order definite clauses. No optimisation, no random generation, but we can now do all the things earlier systems couldn’t, like learning recursive programs with arbitrary structure and predicate invention.

It’s exciting stuff, but the ILP community is small and dwindling and I guess most people outside it are not aware of what’s going on inside it.

In the system I’m building verifying a program against negative (but not positive) examples is certainly a bottleneck. I’ve done pretty much what you suggest, with concurrent_forall/2. I originally used my inductive meta-interpreter for verification but I guess that was a dumb idea from the start. Now I’m handing verification to the SWI-Prolog engine, and it goes really fast.

Here’s the relevant bit of code:

verify_program(Cs,W,Es):-
	PM = experiment_file
	,debug_clauses(verify_program_full,'Verifying program:',Cs)
	,S = (assert_program(PM,Cs,Rs)
	     ,table_untable_predicates(table,PM,Cs)
	     )
	,(   poker_configuration:multithreading(W)
	->   G = (concurrent_forall(member(E,Es)
				  ,(debug(examples,'Verifying Example: ~w', [E])
				   ,call(PM:E)
				   )
				  )
		 )
	 ;   G = forall(member(E,Es)
		       ,(debug(examples,'Verifying Example: ~w', [E])
			,call(PM:E)
			)
		       )
	 )
	,C = (erase_program_clauses(Rs)
	     ,table_untable_predicates(untable,PM,Cs)
	     )
	,setup_call_cleanup(S,G,C)
	,debug(verify_program,'Verified program accepts all examples',[]).

As you can see it’s still doing tabling and untabling on the fly, as you suggested it shouldn’t. That’s about to change.

I have only just started using this new code and I don’t know yet how much, and what kind of, improvement I get from it, so I need to run some benchmarks etc. but I have noticed the overhead and I think I understand there’s a tradeoff.

It has. Generally people have used LLMs as a firehose spewing out programs willy-nilly, then they try to filter the output doing some kind of magickal and mysterious heuristics, statistics, and other -istics. The problem is a) the LLM needs to propose millions of programs before the filter can catch some good ones and b) it still doesn’t work as well as traditional program synthesis; which is also not great.

I think I speak for all programmers if I say I would dearly like to have an automated system to help me program better, except I’d like one that helps me find my bugs and fix them; not one that generates bugs for me to find and fix.

3 Likes

Thanks for the update on ILP. Yes, my knowledge stems from the Aleph days. Interesting to hear that things have changed a lot. I’m a bit surprised by your LLM findings. Is this on abstract data or data that the LLM can link to the world? I recently understood there is some success in guessing database column relations. I assumed this is because the LLM can use statistics about the world (or at least text about the world).

The real issue is not the true positives, those are very easy to spot and report. It gets difficult when you need to tell if something looks fine but is incorrect. Even defining a correctness measure is impossible for questions to which you don’t yet know the answer.

None of this is surprising since the goal of LLMs is to mimic human speech, not to reason. It is able to reason in some situations but this seems to be a side effect of mimicking human speech.

Not sure though.

1 Like

Just text. LLMs model text, not the world and not anything else. There’s been some work claiming LLMs can learn “world models” but it’s very poorly done. A good summary is on Melanie Mitchell’s blog here:

Although I disagree with her conclusion that LLMs learn some kind of world model. There is no reason why a statistical model of text should learn anything about the world just by modelling the structure of text. If it could then we should conclude there is some relation between the structure of text and the structure of the world, in fact not just “a” relation but one strong and clear enough that it is possible to learn one structure from the other. That would be a groundbreaking finding indeed, but as far as I’m aware nobody is claiming anything like that. People are instead just waving their hands at vague statements like “language is a world model”, which have no real meaning.

Even if that assumption was true LLMs would still be completely the wrong kind of machine learning model to guess world structure from text structure. You’d need an approach that can learn “hidden” variables and auto-regressive Transformers, used to train LLMs, can’t do that.

About your earlier question on coding and program repair, this is the kind of work I was referring to:

https://arxiv.org/pdf/2203.07814

This is from DeepMind and you’ll see they make a very big claim of “competition-level code generation” (the title of the paper is “Competition-Level Code Generation with
AlphaCode”; I’m linking to the tech report on arxiv but the same text with some changes was published in Science. I think the Science piece is missing some appendices so I prefer this one).

The approach in the paper is the standard code-firehose-with-a-magickal-filter that I describe in my comment above.

While the paper (and a DeepMind blog about it) trumpets its success against human competitors in a code compeition claiming that “AlphaCode achieved on average a ranking of top 54.3% in competitions with more than 5,000 partic-ipants.” the “competition” was in truth the results from past competitions so AlphaCode did not get to compete directly with humans in a new compeition.

Besides which the discerning computer scientist might want to look at Table 10 (page 15 of the pdf) which lists the results against the held-out test sets:

Those results are considerably lower than the trumpeted “54.4/%”. They’re also much more elucidating than any fluffy and vague claims about besting some human coders here and there.

Note btw that the “10@100k” measure in the final column above means that AlphaCode was asked to generate 100,000 answers and 10 of those were automagikcally selected for submission, and the correct answer (to a coding question) was in those 10 submissions.

In other words, if you have a coding question and you want a system like AlphaCode to solve it for you, you can just ask the model to make 100k guesses and you know that the right answer is somewhere in there about 30% of the time.

Results on basically every other task where LLMs have been tested (like guessing database column relations) are going to be along the same lines: overhyped, but poor when you look closer.

Sorry to disappoint you. Claims about LLMs are mainly hot air. LLMs could be really useful if people tried to maximise their potential as language models and stopped trying to make them into something they aren’t, like program synthesisers, say.

1 Like

There’s a good overview in the following paper, that goes over the new work and connects it to the earlier approaches:

In the interest of full disclosure, note that I’m linking the version of the paper where Stephen Muggleton is a co-author. Stephen was my thesis advisor during my PhD so he kindly cited my own work albeit briefly (reasonably so, since it was still in its early stages). There’s a different version of the same paper by the other two authors that skips over my work completely so I didn’t link to it. Politics eh? :slight_smile:

If you want to see where my work on Meta-Intepretive Learning is at nowadays, it’s all here:

3 Likes