How fast is the indexer?

anniepoo · February 16, 2019, 12:17pm

want to give application programmers using pack(identity) the ability to prevent users from making obscene usernames.
I have a list of 400 words, but can imagine it growing into the few thousand range.
I want to quickly determine if one of the naughty words is a substring of the user input. So ‘jackisadoofus’ is not OK, because doofus is on my list.

Yes, there are lots of OTHER issues with doing this - I’m aware. My question is just an implementation one.

So, one way would be some sort of prefix tree.

But another would be to simply assert all the bad words as facts, then get all the substrings of length up to the max length of a naughty word, and try them against the list.

It might make sense to also assert prefixes of bad words of length (shortest bad word), then first see if this length in this position is a possible match.

Anyway, how fast is doing lookup on long lists of this sort of thing?
Should I convert all to atoms?

jan · February 16, 2019, 6:17pm

The indexing won’t kill you finding all the substrings. What does probably kill you though is the excessive number of atoms that this creates for somewhat longer user names. You’ll probably be ok if you just go through the bad words and use sub_atom/5 to see whether there is a match though. I’d assume not much more than a few milliseconds for that job while it requires virtually no memory.

prodog · July 2, 2019, 9:00pm

Is there a threshold to the number of active atoms in a Prolog app that will cause a noticeable slow down in processing or garbage collection? If so, what is that number?

If I misinterpreted your reply to Annie then my apologies in advance.

jan · July 3, 2019, 7:01am

There are no limits (on 64-bits; on 32 the limit is 24M atoms AFAIK). More atoms do not matter too much. They slow down atom GC a little as that has to do a scan over all of them to see which ones are not marked. Doing substring search by generating all possible substrings as atoms seems a bit of a waste of memory though.

Topic		Replies	Views
Indexing atoms vs. pairs -- performance Help!	5	716	December 28, 2019
Careful with sub_atom/5 and feature request last_sub_atom/5 Discussion	24	751	June 1, 2022
Puzzling Performance Issue Help!	8	885	November 28, 2021
First and second argument indexing -- which should be a first argument Help!	21	1169	July 21, 2021
Bug in isub/4? Predicate	23	1289	April 28, 2021

How fast is the indexer?

Related topics