*magnum opus*of sorts. I.e. it took me a long time and a lot of effort to get it done. This was mostly because I only worked on it while sitting in stuffy rooms listening to lectures of variable capacity to bore. It's a pretty ambitious project for a gritty script as it is, but having to re-learn all the stuff every time I picked at it did not help either.

The basic goal was just to play around implementing other people's algorithms, but when I did I figured I'd make a little library that I could use for some future project, maybe. So I tried to clean up the code, add comments, maybe do a little testing, etc. We'll see how it works out in the end though.

Edit distances are quite a long way outside anything I have contact with on a daily basis (so sorry for any factual errors that may appear) but they do appear from time to time. I think the first and foremost use I ever had for them was when I was making a language drilling/testing application and had to figure out how to compare the user's answer with the correct one. Just comparing strings seemed pretty crude, so I looked around and discovered this idea that the difference between two strings could be defined as the number of edits that one'd need to make to turn one of the strings into the other - the edit distance. And there were several ways to do that already, which was handy.

Here's a little brief on each of the algorithms and the actual code's way down below.

## Levenshtein Distance

Probably the most popular and well-known type of edit distance. You measure it by finding out how many changes (edit operations) to one of the strings you need to make in order to get the other string, and that number is the edit distance between those two strings. In computing Levenshtein distance there are three types edit changes: deletion, addition, and substitution.

*Example:*

- The string "cat" requires a single addition to become "cats", so the Levenshtein distance between "cat" and "cats" is 1.
- The string "loose" requires one deletion to become "lose", so the distance between those two is also 1.
- The string "shape" requires two substitutions to become "scope" (substitute "a" with "o" and "h" with "c") so the distance between them is 2.
- The string "fodo" requires two substitutions to become "food" (substitute "d" with "o" and "o" with "d") so the distance between them is 2.

The algorithm to compute the distance was presented by Robert Wagner and Michael Fischer in a 1974 article called The String-to-String Correction Problem. The algorithm is a dynamic programming one where a two-dimensional distance array each cell at (i,j) representing the number of necessary edit operations between the i-letter-long prefix of the first word and the j-letter-long prefix of the second. The array is generated by checking what kind of operation is required to make the strings the same and their cost is calculated. The bottom-right cost is the final answer.

I am not sure now why I implemented this algorithm twice, but there you go. I was probably confused. One is an almost direct translation from the pseudocode in

**wikipedia**(

**levenshtein_distance**) and the other is implemented from the original article (

**wagner_fischer_distance**).

## Damerau-Levenshtein Distance

This is the extension of the aforementioned Levenshtein edit distance to include the possibility that two characters have been swapped. That is, the Levenshtein distance between the two strings "doom" and "domo" is two, because two substitutions are required to change the first word into the other. But it is argued that the difference between the words "doom" and "domo" slighter (as it is a common misspelling) and that only one change is necessary: to swap the characters "o" and "m". The Levenshtein distance only allows for three basic operations - deletion, insertion, and substitution - and the Damerau-Levenshtein edit distance allows for a forth - transportation.

*Example:*

- The string "fodo" requires one transposition to become "food" (transpose "d" with "o") so the distance between them is 1.

## Hamming Distance

A much simpler and definitely less versatile method of comparing two string of equal length. Basically, the two strings are compared character-by-character and the distance will be equal to the number of positions where the characters differ. If the strings are not of equal length, however, the algorithm simply does nothing (and returns -1 here).

*Example:*

- The strings "cat" and "cats" cannot be compared at all.
- The strings "this" and "that" will have an edit distance of 2, because "i" does not equal "a" and "s" does not equal "t".
- The strings "this" and "hist" have an edit distance of 4, because characters on all positions differ.

## Jaro and Jaro-Winkler Distance

These edit distances are designed for comparison of short strings (i.e. database fields) especially to detect duplicates and near-duplicates (or whatever you call them). They both return a distance between 0 and 1, which should be treated more like a probability of the strings being the same, since 1 means that strings are identical.

In Jaro distance the number of characters matching on some positions and transpositions are checked. Then three modules are calculated:

- the ratio of the number of matched characters to the length of the first string,
- a similar ratio for the second string,
- the score for matches minus the score for transpositions divided by the score for matches.

The arithmetical average of those three modules is the Jaro distance.

Jaro-Winkler is additionally modified by checking the similarity of prefixes and taking it into account more. That is, the Jaro distance is first counted normally, and then the prefixes are compared - the longer the shared prefix, the more similar the words. Commonality of prefixes is assumed to have a certain weight and so the score for a shared prefix is multiplied by that and then this product is multiplied by one minus the Jaro distance. This second product is then added to the Jaro distance thus creating the Jaro-Winkler distance.

*Example:*

- The Jaro distance between "take" and "takes" is 0.9, because they are nearly similar,
- The Jaro-Distance distance between "take" and "takes" is 1 (for prefix weight of 0.1 and max prefix of 4), because they are nearly similar, but share a common prefix which adds to their similarity.

## The code

1 #!/usr/bin/python3 2 # 3 # Edit distance 4 # 5 # Various methods to derive the edit difference between strings (how many 6 # operations are necessary to make the two strings identical). 7 # 8 # The algorithms themselves were ripped off of Wikipedia and other places: 9 # * http://en.wikipedia.org/wiki/Levenshtein_distance 10 # * http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance 11 # * http://en.wikipedia.org/wiki/Hamming_distance 12 # * http://en.wikipedia.org/wiki/Jaro-Winkler_distance and 13 # * http://lingpipe-blog.com/2006/12/13/code-spelunking-jaro-winkler-string-comparison/ 14 # * http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm 15 # 16 # Requires: 17 # Python 3 18 # 19 # Author: 20 # Konrad Siek <konrad.siek@gmail.com> 21 # 22 # License information: 23 # Copyright 2011 Konrad Siek 24 # 25 # This program is free software: you can redistribute it and/or modify 26 # it under the terms of the GNU General Public License as published by 27 # the Free Software Foundation, either version 3 of the License, or 28 # (at your option) any later version. 29 # 30 # This program is distributed in the hope that it will be useful, 31 # but WITHOUT ANY WARRANTY; without even the implied warranty of 32 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 33 # GNU General Public License for more details. 34 # 35 # You should have received a copy of the GNU General Public License 36 # along with this program. If not, see <http://www.gnu.org/licenses/>. 37 38 def __pp_arr_2d(array): 39 """ Print a 2D array.""" 40 from sys import stdout 41 for row in array: 42 stdout.write("%s\n" % row) 43 stdout.write("\n") 44 45 def __make_arr_2d(value, height, width): 46 """ Create a 2d array of given size filled with given value.""" 47 arr = [] 48 for i in range(0, height): 49 arr.append([None] * width) 50 return arr 51 52 def hamming_distance(string_a, string_b, parameters={}): 53 """ Establish the Hamming edit distance between strings. 54 55 The Hamming distance between two strings is the number of substitutions 56 required to make the first string identical to the other one. 57 58 The Hamming distance requires the strings to be of same length. If they 59 are not it returns -1. 60 61 The Hamming distance function requires no additional parameters. 62 """ 63 64 len_a, len_b = len(string_a), len(string_b) 65 66 # Length needs to be the same for both strings. 67 if len_a != len_b: 68 return -1 69 70 # Check for differences. 71 changes = 0 72 for i in range(0, len_a): 73 if string_a[i] != string_b[i]: 74 changes += 1 75 76 return changes 77 78 79 def levenshtein_distance(string_a, string_b, parameters={}): 80 """ Establish the Levenshtein edit distance between strings. 81 82 The distance between identical strings is 0, and for each basic 83 difference between them (any change that needs to be applied to make 84 them identical) the distance grows by 1. 85 86 Basic operations include: insertion, deletion, and substitution. 87 88 The optional parameter 'cost' is the cost of the substitution operation 89 (1 by default). (Parameter name: 'cost'.) 90 """ 91 92 len_a, len_b = len(string_a), len(string_b) 93 94 cost = parameters['cost'] if 'cost' in parameters else 1 95 96 # Initialize array and fill it with zeros. 97 d = [] 98 for i in range(0, len_a + 1): 99 section = [] 100 d.append(section) 101 for j in range(0, len_b + 1): 102 section.append(0) 103 104 # Maximum distances. 105 for i in range(1, len_a + 1): 106 d[i][0] = i 107 for j in range(1, len_b + 1): 108 d[0][j] = j 109 110 # Dynamic programming FTW. 111 for i in range(1, len_a + 1): 112 for j in range(1, len_b + 1): 113 d[i][j] = d[i-1][j-1] \ 114 if string_a[i-1] == string_b[j-1] \ 115 else min(d[i-1][j] + 1, d[i][j-1] + 1, d[i-1][j-1] + cost) 116 117 return d[len_a][len_b] 118 119 120 def damerau_levenshtein_distance(string_a, string_b, parameters={}): 121 """ Establish the Damerau-Levenshtein edit distance between strings. 122 123 The distance between identical strings is 0, and for each basic 124 difference between them (any change that needs to be applied to make 125 them identical) the distance grows by 1. 126 127 Basic operations include: insertion, deletion, substitution, and 128 transportation of two characters. 129 130 The Damerau-Levenshtein distance differs from the basic Levenshtein 131 distance in that it allows for the transportation operation. 132 133 The optional parameter 'cost' is the cost of the substitution operation 134 (1 by default). (Parameter name: 'cost'.) 135 """ 136 137 len_a, len_b = len(string_a), len(string_b) 138 139 cost = parameters['cost'] if 'cost' in parameters else 1 140 141 # Initialize array and fill it with zeros. 142 d = [] 143 for i in range(0, len_a + 1): 144 section = [] 145 d.append(section) 146 for j in range(0, len_b + 1): 147 section.append(0) 148 149 # Maximum distances. 150 for i in range(1, len_a + 1): 151 d[i][0] = i 152 for j in range(1, len_b + 1): 153 d[0][j] = j 154 155 # Dynamic programming FTW. 156 for i in range(1, len_a + 1): 157 for j in range(1, len_b + 1): 158 d[i][j] = d[i-1][j-1] \ 159 if string_a[i-1] == string_b[j-1] \ 160 else min(d[i-1][j] + 1, d[i][j-1], d[i-1][j-1] + cost) 161 # Check for transposition. 162 if i > 1 and j > 1 \ 163 and string_a[i-1] == string_b[j-2] \ 164 and string_a[i-2] == string_b[j-1]: 165 d[i][j] = min(d[i][j], d[i-2][j-2] + cost) 166 167 return d[len_a][len_b] 168 169 170 def jaro_distance(string_a, string_b, parameters={}): 171 """ Establish the Jaro-Winkler edit distance between strings. 172 173 The Jaro distance is useful for duplicate detection and it is aimed at short 174 strings. It returns a score between 0 and 1 where a higher score signifies 175 more similar strings. 176 177 The Jaro distance function requires no additional parameters. 178 """ 179 180 # The algotihm consists of two basic parts: matching and transpositions: 181 # each of these is extracted into a sub-function. 182 def __matching_chars(string_a, string_b): 183 """ Align characters of one string against those of the other 184 string. """ 185 186 length_a, length_b = len(string_a), len(string_b) 187 188 max_dist = max(length_a, length_b) / 2 - 1 189 190 matching_a = [ False for i in range(0, length_a) ] 191 matching_b = [ False for i in range(0, length_b) ] 192 193 for i in range(0, length_a): 194 for j in range(0, length_b): 195 if string_a[i] == string_b[j]: 196 if i - j == 0 or abs(i - j) < max_dist: 197 matching_a[i] = True 198 matching_b[j] = True 199 200 return matching_a, matching_b 201 202 def __transpositions(string_a, string_b, matching_a, matching_b): 203 """ Extract subsequences of alike characters and count those that 204 cannot be alined as requiring transpositions. """ 205 206 def __make_sequence(string, matching): 207 """ Extract a subsequence where both strings match. """ 208 209 sequence = [] 210 for i in range(0, len(string)): 211 if matching[i]: 212 sequence.append(string[i]) 213 return sequence 214 215 # Find subsequences. 216 sequence_a = __make_sequence(string_a, matching_a) 217 sequence_b = __make_sequence(string_b, matching_b) 218 219 half_transpositions = 0 220 221 # Count half-transpositions. 222 for i in range(0, min(len(sequence_a), len(sequence_b))): 223 #print(len(sequence_a), len(sequence_b), i) 224 if sequence_a[i] != sequence_b[i]: 225 half_transpositions += 1 226 half_transpositions += abs(len(sequence_a) - len(sequence_b)) 227 228 return half_transpositions / 2 229 230 matching_a, matching_b = __matching_chars(string_a, string_b) 231 232 m = float(matching_a.count(True)) 233 t = float(__transpositions(string_a, string_b, matching_a, matching_b)) 234 235 len_a = float(len(string_a)) 236 len_b = float(len(string_b)) 237 238 module_a = m / len_a if len_a > 0 else 0 239 module_b = m / len_b if len_b > 0 else 0 240 module_t = (m - t) / m if m > 0 else 0 241 242 return (module_a + module_b + module_t) / 3.0 243 244 245 def jaro_winkler_distance(string_a, string_b, parameters={}): 246 """ Establish the Jaro-Winkler edit distance between strings. 247 248 The Jaro-Winkler distance is useful for duplicate detection and it is aimed 249 at short strings. It returns a score between 0 and 1 where a higher score 250 signifies more similar strings. 251 252 The Jaro-Winkler distance is more favorable to strings which have a common 253 prefix. The length of the prefix is 4. (Parameter name: 'max_prefix'.) 254 255 The weight given to the similarity of prefixes is set by the scaling factor 256 p which typically is set to 0.1 and should not exceed 0.25 because the 257 distance could then be larger than 1. (Parameter name: 'p'.) 258 """ 259 260 # Parameters and defaults. length_b 261 p = parameters['p'] if 'p' in parameters else 0.1 262 max_prefix = parameters['max_prefix'] if 'max_prefix' in parameters else 4 263 264 # Establish the ordinary Jaro distance. 265 dj = jaro_distance(string_a, string_b, parameters) 266 l = 0 267 268 # Account for the prefix - words starting similarly are considered more 269 # similar. 270 for i in range(0, min(len(string_a), len(string_b), max_prefix)): 271 if string_a[i] == string_b[i]: 272 l += 1 273 else: 274 break 275 276 # Adjust the distance with the prefix-related score. 277 return dj + (p * l * (1 - dj)) 278 279 280 def wagner_fischer_distance (string_a, string_b, parameters={}): 281 """ Establish the Levenshtein edit distance between strings using the 282 Wagner-Fischer dynamic programming algorithm. 283 284 Algorithm X in the original article. 285 286 The distance between identical strings is 0, and for each basic 287 difference between them (any change that needs to be applied to make 288 them identical) the distance grows by 1. 289 290 Basic operations include: insertion, deletion, substitution. 291 292 The optional parameter 'cost' is the cost of the substitution operation 293 (1 by default). (Parameter name: 'cost'.) 294 """ 295 removal_cost = 1 296 insertion_cost = 1 297 substitution_cost = 1 if 'cost' not in parameters else parameters['cost'] 298 299 length_a = len(string_a) + 1 300 length_b = len(string_b) + 1 301 302 d = __make_arr_2d(None, height=length_a, width=length_b) 303 304 d[0][0] = 0 305 306 # Fill in the defaults for string A. 307 for i in range(1, length_a): 308 d[i][0] = d[i - 1][0] + removal_cost # of removing string_a[i - 1] 309 310 # Fill in the defaults for string B. 311 for j in range(1, length_b): 312 d[0][j] = d[0][j - 1] + insertion_cost # of inserting string_b[i - 1] 313 314 # The trace cost function (gamma in the original article): it is simplified 315 # to check only whether the two characters are the same or not. No empty 316 # strings are expected though. 317 gamma = lambda a, b: 0 if a == b else substitution_cost 318 319 # Fill in the rest of the array. 320 for i in range(1, length_a): 321 for j in range(1, length_b): 322 m1 = d[i - 1][j - 1] + gamma(string_a[i - 1], string_b[j - 1]) 323 m2 = d[i - 1][j] + removal_cost # of removing string_a[i - 1] 324 m3 = d[i][j - 1] + insertion_cost # of inserting string_b[j - 1] 325 326 d[i][j] = min(m1, m2, m3) 327 328 return d[length_a - 1][length_b - 1] 329 330 # All the functions and their names for convenience. 331 API = { 332 'Levenshtein': levenshtein_distance, 333 'Damerau-Levenshtein': damerau_levenshtein_distance, 334 'Hamming': hamming_distance, 335 'Jaro': jaro_distance, 336 'Jaro-Winkler': jaro_winkler_distance, 337 'Wagner-Fischer': wagner_fischer_distance, 338 } 339 340 # A demonstration. 341 if __name__ == '__main__': 342 from sys import argv, stdout 343 344 if len(argv) < 3 or len(argv) % 2 < 1: 345 stdout.write("Usage: %s WORD_A WORD_B [ WORD_A WORD_B [...] ]\n" \ 346 % argv[0]) 347 exit(-1) 348 349 for i in range(1, int(len(argv)/2) + 1): 350 351 word_a = argv[2*i - 1] 352 word_b = argv[2*i] 353 sentence_a, sentence_b = [s.split() for s in [word_a, word_b]] 354 355 title = 'distance between "%s" and "%s"' % (word_a, word_b) 356 maxlen = max([len(i) for i in API.keys()] + [len(title)]) 357 358 row = '| %' + str(maxlen) + 's | %5.1f | %5.1f |\n' 359 strrow = row.replace('.1f', 's') 360 edge = '%s\n' % ('-' * (29 + maxlen)) 361 rule = strrow.replace('|', '+').replace(' ', '-') % \ 362 (maxlen * '-', 5 * '-', 5 * '-') 363 364 stdout.write(rule) 365 stdout.write(row.replace('.1f', 's') % (title, 'words', 'lists')) 366 stdout.write(rule) 367 for method in API: 368 word_r = float(API[method](word_a, word_b)) 369 sentence_r = float(API[method](sentence_a, sentence_b)) 370 stdout.write(row % (method, word_r, sentence_r)) 371 stdout.write(rule) 372 stdout.write("\n") 373

The code is available on GitHub at python/editdist.py.

## No comments:

Post a Comment