Bug in isub/4?

There seems to be a bug in isub/4:

5 ?- isub(abc,abc,true,S).
S = 1.0.

6 ?- isub(ab,ab,true,S).
S = 0.09999999999999998.

Am I missing something?

EDIT: It seems to happen with short lengths only:

15 ?- numlist(65,80,Cs), between(1,15,N), maplist(char_code,As,Cs), length(L,N), prefix(L,As),
atomic_list_concat(L,A), isub(A,A,true,S), S \= 1.0.
Cs = [65, 66, 67, 68, 69, 70, 71, 72, 73|...],
N = 1,
As = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'|...],
L = ['A'],
A = 'A',
S = 0.04999999999999999 ;
Cs = [65, 66, 67, 68, 69, 70, 71, 72, 73|...],
N = 2,
As = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'|...],
L = ['A', 'B'],
A = 'AB',
S = 0.09999999999999998 ;
false.

EDIT2: another strange case with length greater than 2 (although it may be a side effect of small lengths)

5 ?- isub(hks,jks,true,S).   %  BUG?
S = 0.0.

6 ?- isub(hkst,jkst,true,S).   % OK
S = 0.8346774193548387.

7 ?- isub(hkstw,jkstw,true,S).  % OK
S = 0.8731182795698925.

8 ?- isub(hkstwx,jkstwx,true,S). % OK
S = 0.8974358974358975.

Another failing case:

13 ?- isub(aahks,aajks,true,S). % seems to be a side effect of original short length bug
S = 0.09999999999999998.

14 ?- isub(aaaahks,aaaajks,true,S). % OK
S = 0.7657947686116701.

15 ?- isub(aaahks,aaajks,true,S). % OK
S = 0.6861111111111111.

AFAIK, the isub algorithm is based on longest common sub-string. Check the sources (it isnā€™t much). My contribution is a fairly mechanic translation of the published Java to C. I suspicion is that this is just an artifact of the algorithm.

I found the original paper here:
http://www.image.ece.ntua.gr/papers/378.pdf

Looking at isub.c, the problem seems to be solved by deleting 3 lines:

Replace:

    if (best > 2)
      common += best;
    else
      best = 0;

with

    common += best;

In my quick scan of the paper I didnā€™t see any reason to limit the ā€˜bestā€™ substring to those with length greater than 2, but I didnā€™t find the original java code.

With the above changes the bug seems solved and things work fine with my limited test.

I am not able to submit a github PR in my current conditions, but I hope this solves the issue.

EDIT: Perhaps the original author did want to limit ā€˜bestā€™ substrings to those with length greater than 2, because it is meant to be used with ontologies, in which substrings will tend to be meaningful words, which are usually more than 2 characters long. If this is the case, and you decide to keep it as it is, the documentation should probably say that this algorithm fails when comparing texts with differences/similarities less than or equal to two characters.

1 Like

Iā€™m a little reluctant about that. Surely I didnā€™t invent the best > 2 myself. I guess the developers have added this condition for a reason, probably after publishing the paper. So, most likely you gain some and you loose some :slight_smile:

I have some vague memory about a project where we used dwim_match/3 for really short atoms and isub/4 for longer ones.

This is rather tricky stuff. isub/4 is a simple crude algorithm that works fairly well for a particular class of terms. I think it was developed in a project about ontology matching. Most ontology terms are a bit more ā€œcomplicatedā€ words and therefore usually not very short. This style of matching also differs from finding typos (where you should concentrate on typical keyboard errors and sounds like). What is your use-case?

1 Like

What do you think about adding an option to specify the allowance of short strings? So the algorithm will remain the same, but if you add the option short it enables processing short strings.

I find the isub algorithm extremely good, as it takes into account both similarities and differences in order to calculate the distance between two strings, and it does the calculation in place.

I definitely need short strings with a numeric output of distance, since I am comparing with a list of possibilities, and then sort the results based on distance. dwim_match/3 will not work for me because I need numeric output to sort the results.

So the code will be something like:

    if (!short) {  // Original algorithm used normally
        if (best > 2)
          common += best;
        else
          best = 0;
    } else {  // Allow short strings
        common += best;
    }

I can live with that. Originally, case normalization was always performed. Iā€™ve made that optional too. Will you make a PR?

In my circumstances right now I canā€™t make a PR, would you take a diff post to this thread?

Preferable using git and git format-patch. Plain diff is the last resort.

Sure, what API would you prefer from the prolog side?

  1. Keep isub/4 as it is, and add isub_short/4 or
  2. Change the Normalize argument in isub/4 to allow for:
    • true ā†’ normalize, compatibility with older code
    • false ā†’ do not normalize, compatibility
    • short_normalize ā†’ short with normalization
    • short ā†’ short without normalization

To be honest, I donā€™t know. Both options are not very nice. Ideally weā€™d use an option list, but this significantly slows down. One option might be to use an option list and add goal_expansion/2 to call the right thing, in which case it doesnā€™t matter too much what the ugly thing looks like.

Do as you think is best (as long as it remains compatible of course).

1 Like

Yes, I agree.

In terms of semantics from the algorithmic point of view, I think it is better to add isub_short/4, because one of the properties of the algorithm for use in the semantic web is to produce lower scores for things that in a human language are very different: e.g. score and store produce a distance of 0.45 (according to the paper, but in the SWIProlog implementation it produces 0.72) whereas with the short algorithm it will produce 0.88. So that makes me think that it would be better to add isub_short/4 to make clear this difference of intent. Do you agree?

1 Like

Thanks, here is the git diff patch including documentation. I had to use the log extension since it did not allow upload otherwise. I had some trouble with git format-patch, so I did git diff.

isub_short.log (5.7 KB)

Thanks. Merged. I also found the Java version that was apparently lost on some old machine (not checked in, I guess). Added to the repo.

1 Like

There must be some bug in the C implementation, the java code returns the following results:

$ java I_Sub          
Test de la mesure ISub

Test: store VS spore
	->Resultat: 0.4530841121495327

Test: numPages VS numberOfPages
	->Resultat: 0.8333333333333333

Test: DosageDuFacteurV VS MesureDuFacteurV
	->Resultat: 0.5670761078998073

Test: DosageDuFacteurV VS DosageDuFacteurX
	->Resultat: 0.9564759036144579

Test: DosageDuFacteurV VS MesureDuFacteurX
	->Resultat: 0.45833333333333337

Test: SyndromeDeKawasaki VS MaladieDeKawasaki
	->Resultat: 0.465527950310559

whereas the C code from prolog returns the following:

14 ?- L=[store,'numPages','DosageDuFacteurV','DosageDuFacteurV','DosageDuFacteurV','SyndromeDeKawasaki'], L1=[spore,'numberOfPages','MesureDuFacteurV','DosageDuFacteurX','MesureDuFacteurX','MaladieDeKawasaki'], maplist([A,B,R]>>(isub(A,B,true,D),R=[A-B,D]),L,L1,Result),print_term(Result,[]).
[ [store-spore,0.7265420560747664],
  [numPages-numberOfPages,0.9166666666666666],
  ['DosageDuFacteurV'-'MesureDuFacteurV',0.7835380539499037],
  ['DosageDuFacteurV'-'DosageDuFacteurX',0.978237951807229],
  ['DosageDuFacteurV'-'MesureDuFacteurX',0.7291666666666667],
  ['SyndromeDeKawasaki'-'MaladieDeKawasaki',0.7327639751552795]
]

Quite a big difference. Notably store/spore produce 0.45 in the Java code, as it is mentioned in the paper; whereas the C code produces 0.72. The same goes for DosageDuFacteurV VS MesureDuFacteurX, and SyndromeDeKawasaki VS MaladieDeKawasaki. My guess is that the bug is in the inplace implementation.

I donā€™t know. It is long ago. Note that the I_Sub.java I added to the repo is not the one I originally used to port. That one is unfortunately lost. I guess I verified the output by then, but Iā€™m not sure. You seem to be able to read both :slight_smile:

:smile: hehe youā€™re funny, Iā€™ll take a look when I get some time (it may be a while, since the way it is now works for my use case).
If we have to go back to code that does not make the comparison in place (e.g. may need to allocate) would you be OK with it?

From what I recall Iā€™d be very surprised if that was the problem. Making a copy is not a big deal, although I typically do use a local array for ā€œshortā€ strings. Reducing malloc() stress is far less a problem than it used to be, but still seems to be worthwhile in many cases.

Someone (JE?, is that you?) added this change:

   // Modification JE: returned normalization (instead of [-1 1])
    result = commonality - dissimilarity + winklerImprovementVal;

  return (result + 1) / 2;   // <---- This turns the 0.45 into 0.72

I think this does some damage to the original algorithm, which is tweaked in such a way that values above 0.6 are probably a good match. Normalizing to [0,1] breaks this nice feature, since the threshold would now be 0.72 which in my mind means 72% similarity instead of the original 45% similarity. This is because the algorithm measures dissimilarity as well as similarity, and so the range [-1,1] makes more sense.

Parametric version

I was thinking that instead of isub_short/4 we should have isub_parametric/4, where the last argument is an Options array
which would include an option substring_threshold. For the original algorithm substring_threshold ==2 whereas isub_short/4 uses substring_threshold==0. This would allow the user to tweak values depending on application. The Options would also include
normalize=Boolean. This would also allow for future improvements.

Questions

So to summarize I have two questions:

  1. would you like me to submit patches for isub_parametric/4 as describes above? or leave isub_short/4 as it is?
  2. Do you think we should remove the normalization that turns 0.45 into 0.72? I think this is better to match the original paper (leaving the original [-1,1] range). Or should we make that an option with the
    new isub_parametric/4, leaving the original isub/4 untouched?
2 Likes

Thanks for figuring this out. Good to hear this wasnā€™t a simple typo :slight_smile:

As isub_short/4 is among us for just a couple of days we can replace it without affecting anyone too seriously :slight_smile:

I donā€™t know. isub/4 came to us in an ontology alignment project where the group in which isub was developed participated. We never had any discussion on this (JE is not me. If I sign stuff I use JW).

Providing an option list and parsing this in C slows down a bit too much, I fear. A simple option I see is to pack all parameters into a single int (int64) and have a ā€˜$isubā€™/4. Now the ideal API is probably

isub(Text1, Text2, Similarity, Options)

which we can implement as runtime mapping to call '$isubā€™4 and provide goal_expansion to do this mapping at compile time if Options is provided and sufficiently instantiated. At least Iā€™ve used isub/4 in scenarios where speed really mattered.

Can you deal with that?