want to give application programmers using pack(identity) the ability to prevent users from making obscene usernames.
I have a list of 400 words, but can imagine it growing into the few thousand range.
I want to quickly determine if one of the naughty words is a substring of the user input. So ‘jackisadoofus’ is not OK, because doofus is on my list.
Yes, there are lots of OTHER issues with doing this - I’m aware. My question is just an implementation one.
So, one way would be some sort of prefix tree.
But another would be to simply assert all the bad words as facts, then get all the substrings of length up to the max length of a naughty word, and try them against the list.
It might make sense to also assert prefixes of bad words of length (shortest bad word), then first see if this length in this position is a possible match.
Anyway, how fast is doing lookup on long lists of this sort of thing?
Should I convert all to atoms?
The indexing won’t kill you finding all the substrings. What does probably kill you though is the excessive number of atoms that this creates for somewhat longer user names. You’ll probably be ok if you just go through the bad words and use sub_atom/5 to see whether there is a match though. I’d assume not much more than a few milliseconds for that job while it requires virtually no memory.
Is there a threshold to the number of active atoms in a Prolog app that will cause a noticeable slow down in processing or garbage collection? If so, what is that number?
If I misinterpreted your reply to Annie then my apologies in advance.
There are no limits (on 64-bits; on 32 the limit is 24M atoms AFAIK). More atoms do not matter too much. They slow down atom GC a little as that has to do a scan over all of them to see which ones are not marked. Doing substring search by generating all possible substrings as atoms seems a bit of a waste of memory though.