I’ve been thinking about the ongoing conversation, and I wanted to clarify something. Are you, Jan, suggesting that because LLMs have turned out to be such a good idea, maybe autoencoders (or a similar approach) could also be surprisingly effective in ways we haven’t fully explored yet?
From my perspective, LLMs work well because they index statistical relationships in massive token datasets, allowing for efficient next-token prediction. That doesn’t necessarily mean that autoencoders—or other compression-based methods—would benefit from the same kind of scaling and brute-force approach.
Are you thinking that autoencoders could play a bigger role in tasks like language modeling, retrieval, or structured representation learning? Or are you just drawing a general parallel between their unexpected effectiveness?
Would love to hear more about what you’re thinking!
The problem I am trying to adress was already adressed here:
ILP and Reasoning by Analogy
Intuitively, the idea is to use what is already known to explain
new observations that appear similar to old knowledge. In a sense,
it is opposite of induction, where to explain the observations one
comes up with new hypotheses/theories.
Vesna Poprcova et al. - 2010 https://www.researchgate.net/publication/220141214
The problem consists in that ILP doesn’t try to learn and apply
analogies , whereas autoencoders and transformers typically try
to “Grok” analogies, so that with a fewer training data they can
perform well in certain domains. They will do some inferencing on the
part of the encoders also for unseen input data. And they will do
some generation on the part of the decoder also for unseen
latent space configurations from unseen input data. By unseen
data I mean data not in the training set. The full context window
may tune the inferencing and generation, which appeals to:
Analogy as a Search Procedure
Rumelhart and Abrahamson showed that when presented with
analogy problems like mokey:pig:gorilla:X, with rabbit, tiger, cow,
and elephant as alternatives for X, subjects rank the four options
following the parallelogram rule.
Matías Osta-Vélez - 2022 https://www.researchgate.net/publication/363700634
There are learning methods that work similarly like ILP, in
that they are based on positive and negative samples. And the
statistics can involve bilinear forms.
I just wanted to clarify, @j4n_bur53 (in the context of a message that was deleted) that I’m not angry at you, or at anyone in this conversation. I like to disagree robustly. I hope we can do this without having a fight.
Or I guess it’s the fault of the Mediterranean temperament. I also tend to wave my hands wildly and jump up and down when I really get into my stride. I kind of understand why this comes across as if I’m having some sort of crisis, but if you’re worried just watch the whites of my eyes: if they start rolling into my head, that’s when it starts to get dangerous
The problem with connectionism is that its not
really new and that for those who used it already
for decades its also not a paradigma change.
I do not count myself as somebody well versed
in connectionism, I have only a basic artificial intelligence
literacy with inclusivity in that connectionism is
taken as a paradigma inside artificial intelligence.
The literacy of mine has become extremely rusty
and dusty, 40 years ago it was just part of the usual
curriculum for every computer science student, everybody
got an introduction into connectionism back then, and
there were for example financial institutes that used neural
networks for trading. So its not astonishing that part of modern
neural networks was invented in Lugano Switzerland,
which has a more mediterrane clima than for example Zurich.
But the application was rather speech:
Deep Learning in Neural Networks: An Overview
Jügen Schmidhuber - 2014 - The Swiss AI Lab IDSIA
The present survey, however, will focus on the narrower, but
now commercially important, subfield of Deep Learning (DL)
in Artificial Neural Networks (NNs) https://arxiv.org/abs/1404.7828
The above stops at 2013, but it has already LSTM transformers.
Its doesn’t help me to implement something. Try this one, it has a
little Java code. But its a little ancient technologie using the sigmoid
activation function. And it seems to me it uses some graph datastructure:
Translating the Java code to Prolog from the Pfeiffer paper into linear
algebra using vectors and matrixes, I have now a little piece of pure
Prolog code, that runs also in the Browser, that can already learn an
AND, and its using the ReLU activation function, i.e. not the FANN_SIGMOID
activation function anymore. I simulated the bias by an extra input neuron
which is always 1, because I was too lazy to have bias in the model:
I only use a simply update of the network via μ Δwij and the above is
from only 1000 iterations. So no momentum based method implemented
or otherwise advanced gradient search yet implemented:
The first network parameter is the forward evaluated network, and the
second network parameter is the error backward propagated network.
Libraries such as PyTorch cooperate with optimizer libraries
that provide a variety of gradient search methods. One needs
to study how these library are architectured so that they provide
plug and play. Maybe can bring the same architecture to Prolog:
A Gentle Introduction to torch.autograd
Next, we load an optimizer, in this case SGD with a
learning rate of 0.01 and momentum of 0.9. We register all
the parameters of the model in the optimizer.