Note that instead of using :-include(train)., by better use
:- ensure_loaded(train).
and prior run
swipl qlf compile train.pl
That saves nearly a second. Do the same trick for the other larger include files.
In general I’m not really surprised. PyClause is fast, but it is a dedicated machinery for a single task in C++ as opposed to a generic machine. As we have seen, you can also speedup a couple of times by writing the Prolog code optimally and some more exploiting concurrency.