László KozmaHome | About me | Projects | Links | Weblog | Ideas string comparison tweaks [more ideas] Edit distance or Levenshtein distance is the de-facto standard in string comparison. It is used in most spell-checkers and similar applications. The algorithm basically counts how many character insertions, deletions or replacements are needed at a minimum to transform the string s1 to s2. The resulting number is the distance between the two strings. In various situations I have used or considered using some tweaks on this algorithm. I'm pretty sure they have been used and documented elsewhere a long time ago: 1. Normalize by length. 'Conceptualize' and 'Conceptualixe' are intuitively more similar than 'zoo' and 'moo'. The trick is to normalize in such a way that the resulting distance will be between 0 and 1, and satisfies some nice mathematical properties, like the triangle-inequality. (the two strings can differ in length). 2. Assign weights to the atomic operations. On a regular keyboard 'q' is more likely to be a mis-typed 'w' than a mis-typed 'm'. A duplication of a letter could be considered less of a difference, then an additional letter from the other end of the keyboard. Therefore, edits at the character-level could be weighted with the distance on the keyboard. This is an example of a keyboard-distance metric. In other applications, for example in handwritten character - or mouse gestures recognition, a shape is encoded as a sequence of directions. For example, a straight vertical line drawn from bottom-up, would be NNNNNNN, where N is 'North'. Another shape could be NEESSWWN. The similarity of the two shapes is often measured using the String edit distance. (Obviously more than 4 directions are used). Here too the edit operations could be weighted, with the angle between two individual directions. Again the trick is to preserve mathematical properties of the distance function, and keep it in a meaningful range. [EDIT: 25.06.2008] 3.Use other operations. Besides insert, delete or replace, other operations are meaningful in various applications. When editing natural language text or a computer program, moving a chunk of text from one place to another doesn't seem to make the text as different as the edit difference would suggest (it is counted as a series of deletions plus a series of insertions). Therefore an additional move operation could be used, which is equivalent with a successive delete and insert of the same character. Unfortunately the algorithm for calculating the traditional edit distance becomes much more complicated in this way. I am sure this has been studied in the literature, but I haven't found very relevant references. Researchers have considered even more general operations, where the word is considered as a linked list, and all operations are valid that take O(1) time. For linked list, besides the previous ones, this includes removing substring of arbitrary length, or moving substring of arbitrary length, as both can be implemented by manipulating pointers. In some applications these operations might be sensible, in others not. The reference paper is: Edit distance with move operations Dana Shapira, James A. Storer Journal of Discrete Algorithms Volume 5 , Issue 2 (June 2007) Note: The Damerau-Levenshtein distance allows single character transpositions besides the other three operations, but I haven't found any reference of single character moves. blog comments powered by Disqus |