Monday, December 5, 2011

Edit distances

This is my magnum opus of sorts. I.e. it took me a long time and a lot of effort to get it done. This was mostly because I only worked on it while sitting in stuffy rooms listening to lectures of variable capacity to bore. It's a pretty ambitious project for a gritty script as it is, but having to re-learn all the stuff every time I picked at it did not help either.

The basic goal was just to play around implementing other people's algorithms, but when I did I figured I'd make a little library that I could use for some future project, maybe. So I tried to clean up the code, add comments, maybe do a little testing, etc. We'll see how it works out in the end though.

Edit distances are quite a long way outside anything I have contact with on a daily basis (so sorry for any factual errors that may appear) but they do appear from time to time. I think the first and foremost use I ever had for them was when I was making a language drilling/testing application and had to figure out how to compare the user's answer with the correct one. Just comparing strings seemed pretty crude, so I looked around and discovered this idea that the difference between two strings could be defined as the number of edits that one'd need to make to turn one of the strings into the other - the edit distance. And there were several ways to do that already, which was handy.

Here's a little brief on each of the algorithms and the actual code's way down below.

Levenshtein Distance


Probably the most popular and well-known type of edit distance. You measure it by finding out how many changes (edit operations) to one of the strings you need to make in order to get the other string, and that number is the edit distance between those two strings. In computing Levenshtein distance there are three types edit changes: deletion, addition, and substitution.

Example:
  • The string "cat" requires a single addition to become "cats", so the Levenshtein distance between "cat" and "cats" is 1.
  • The string "loose" requires one deletion to become "lose", so the distance between those two is also 1.
  • The string "shape" requires two substitutions to become "scope" (substitute "a" with "o" and "h" with "c") so the distance between them is 2.
  • The string "fodo" requires two substitutions to become "food" (substitute "d" with "o" and "o" with "d") so the distance between them is 2.

The algorithm to compute the distance was presented by Robert Wagner and Michael Fischer in a 1974 article called The String-to-String Correction Problem. The algorithm is a dynamic programming one where a two-dimensional distance array each cell at (i,j) representing the number of necessary edit operations between the i-letter-long prefix of the first word and the j-letter-long prefix of the second. The array is generated by checking what kind of operation is required to make the strings the same and their cost is calculated. The bottom-right cost is the final answer.

I am not sure now why I implemented this algorithm twice, but there you go. I was probably confused. One is an almost direct translation from the pseudocode in wikipedia (levenshtein_distance) and the other is implemented from the original article (wagner_fischer_distance).

Damerau-Levenshtein Distance


This is the extension of the aforementioned Levenshtein edit distance to include the possibility that two characters have been swapped. That is, the Levenshtein distance between the two strings "doom" and "domo" is two, because two substitutions are required to change the first word into the other. But it is argued that the difference between the words "doom" and "domo" slighter (as it is a common misspelling) and that only one change is necessary: to swap the characters "o" and "m". The Levenshtein distance only allows for three basic operations - deletion, insertion, and substitution - and the Damerau-Levenshtein edit distance allows for a forth - transportation.

Example:
  • The string "fodo" requires one transposition to become "food" (transpose "d" with "o") so the distance between them is 1.

Hamming Distance


A much simpler and definitely less versatile method of comparing two string of equal length. Basically, the two strings are compared character-by-character and the distance will be equal to the number of positions where the characters differ. If the strings are not of equal length, however, the algorithm simply does nothing (and returns -1 here).

Example:
  • The strings "cat" and "cats" cannot be compared at all.
  • The strings "this" and "that" will have an edit distance of 2, because "i" does not equal "a" and "s" does not equal "t".
  • The strings "this" and "hist" have an edit distance of 4, because characters on all positions differ.

Jaro and Jaro-Winkler Distance


These edit distances are designed for comparison of short strings (i.e. database fields) especially to detect duplicates and near-duplicates (or whatever you call them). They both return a distance between 0 and 1, which should be treated more like a probability of the strings being the same, since 1 means that strings are identical.

In Jaro distance the number of characters matching on some positions and transpositions are checked. Then three modules are calculated:
  • the ratio of the number of matched characters to the length of the first string,
  • a similar ratio for the second string,
  • the score for matches minus the score for transpositions divided by the score for matches.

The arithmetical average of those three modules is the Jaro distance.

Jaro-Winkler is additionally modified by checking the similarity of prefixes and taking it into account more. That is, the Jaro distance is first counted normally, and then the prefixes are compared - the longer the shared prefix, the more similar the words. Commonality of prefixes is assumed to have a certain weight and so the score for a shared prefix is multiplied by that and then this product is multiplied by one minus the Jaro distance. This second product is then added to the Jaro distance thus creating the Jaro-Winkler distance.

Example:
  • The Jaro distance between "take" and "takes" is 0.9, because they are nearly similar,
  • The Jaro-Distance distance between "take" and "takes" is 1 (for prefix weight of 0.1 and max prefix of 4), because they are nearly similar, but share a common prefix which adds to their similarity.

The code


  1 #!/usr/bin/python3
  2 #
  3 # Edit distance
  4 #
  5 # Various methods to derive the edit difference between strings (how many
  6 # operations are necessary to make the two strings identical). 
  7 #
  8 # The algorithms themselves were ripped off of Wikipedia and other places:
  9 #     * http://en.wikipedia.org/wiki/Levenshtein_distance
 10 #     * http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance
 11 #     * http://en.wikipedia.org/wiki/Hamming_distance
 12 #     * http://en.wikipedia.org/wiki/Jaro-Winkler_distance and
 13 #     * http://lingpipe-blog.com/2006/12/13/code-spelunking-jaro-winkler-string-comparison/
 14 #     * http://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm
 15 # 
 16 # Requires:
 17 #     Python 3
 18 #
 19 # Author:
 20 #     Konrad Siek <konrad.siek@gmail.com>
 21 #
 22 # License information:
 23 #     Copyright 2011 Konrad Siek 
 24 #
 25 #     This program is free software: you can redistribute it and/or modify
 26 #     it under the terms of the GNU General Public License as published by
 27 #     the Free Software Foundation, either version 3 of the License, or
 28 #     (at your option) any later version.
 29 #
 30 #     This program is distributed in the hope that it will be useful,
 31 #     but WITHOUT ANY WARRANTY; without even the implied warranty of
 32 #     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 33 #     GNU General Public License for more details.
 34 #
 35 #     You should have received a copy of the GNU General Public License
 36 #     along with this program.  If not, see <http://www.gnu.org/licenses/>.
 37 
 38 def __pp_arr_2d(array):
 39     """    Print a 2D array."""
 40     from sys import stdout
 41     for row in array:
 42         stdout.write("%s\n" % row)
 43     stdout.write("\n")
 44 
 45 def __make_arr_2d(value, height, width):
 46     """    Create a 2d array of given size filled with given value."""
 47     arr = []
 48     for i in range(0, height):
 49         arr.append([None] * width)
 50     return arr
 51 
 52 def hamming_distance(string_a, string_b, parameters={}):
 53     """    Establish the Hamming edit distance between strings.
 54     
 55     The Hamming distance between two strings is the number of substitutions
 56     required to make the first string identical to the other one.
 57 
 58     The Hamming distance requires the strings to be of same length. If they
 59     are not it returns -1.
 60 
 61     The Hamming distance function requires no additional parameters.
 62     """
 63 
 64     len_a, len_b = len(string_a), len(string_b)
 65 
 66     # Length needs to be the same for both strings.
 67     if len_a != len_b:
 68         return -1
 69 
 70     # Check for differences.
 71     changes = 0
 72     for i in range(0, len_a):
 73         if string_a[i] != string_b[i]:
 74             changes += 1
 75 
 76     return changes
 77 
 78 
 79 def levenshtein_distance(string_a, string_b, parameters={}):
 80     """    Establish the Levenshtein edit distance between strings.
 81     
 82     The distance between identical strings is 0, and for each basic 
 83     difference between them (any change that needs to be applied to make
 84     them identical) the distance grows by 1.
 85 
 86     Basic operations include: insertion, deletion, and substitution. 
 87 
 88     The optional parameter 'cost' is the cost of the substitution operation
 89     (1 by default). (Parameter name: 'cost'.)
 90     """
 91 
 92     len_a, len_b = len(string_a), len(string_b)
 93 
 94     cost = parameters['cost'] if 'cost' in parameters else 1
 95 
 96     # Initialize array and fill it with zeros.
 97     d = []
 98     for i in range(0, len_a + 1):
 99         section = []
100         d.append(section)
101         for j in range(0, len_b + 1):
102             section.append(0)
103 
104     # Maximum distances.
105     for i in range(1, len_a + 1):
106         d[i][0] = i
107     for j in range(1, len_b + 1):
108         d[0][j] = j
109 
110     # Dynamic programming FTW.
111     for i in range(1, len_a + 1):
112         for j in range(1, len_b + 1):
113             d[i][j] = d[i-1][j-1] \
114                 if string_a[i-1] == string_b[j-1] \
115                 else min(d[i-1][j] + 1, d[i][j-1] + 1, d[i-1][j-1] + cost)
116 
117     return d[len_a][len_b]
118 
119 
120 def damerau_levenshtein_distance(string_a, string_b, parameters={}):
121     """    Establish the Damerau-Levenshtein edit distance between strings.
122     
123     The distance between identical strings is 0, and for each basic 
124     difference between them (any change that needs to be applied to make
125     them identical) the distance grows by 1.
126 
127     Basic operations include: insertion, deletion, substitution, and 
128     transportation of two characters.
129 
130     The Damerau-Levenshtein distance differs from the basic Levenshtein
131     distance in that it allows for the transportation operation.
132 
133     The optional parameter 'cost' is the cost of the substitution operation
134     (1 by default). (Parameter name: 'cost'.)
135     """
136 
137     len_a, len_b = len(string_a), len(string_b)
138 
139     cost = parameters['cost'] if 'cost' in parameters else 1
140 
141     # Initialize array and fill it with zeros.
142     d = []
143     for i in range(0, len_a + 1):
144         section = []
145         d.append(section)
146         for j in range(0, len_b + 1):
147             section.append(0)
148 
149     # Maximum distances.
150     for i in range(1, len_a + 1):
151         d[i][0] = i
152     for j in range(1, len_b + 1):
153         d[0][j] = j
154 
155     # Dynamic programming FTW.
156     for i in range(1, len_a + 1):
157         for j in range(1, len_b + 1):
158             d[i][j] = d[i-1][j-1] \
159                 if string_a[i-1] == string_b[j-1] \
160                 else min(d[i-1][j] + 1, d[i][j-1], d[i-1][j-1] + cost)
161             # Check for transposition.
162             if i > 1 and j > 1 \
163                 and string_a[i-1] == string_b[j-2] \
164                 and string_a[i-2] == string_b[j-1]:
165                 d[i][j] = min(d[i][j], d[i-2][j-2] + cost)
166 
167     return d[len_a][len_b]
168 
169 
170 def jaro_distance(string_a, string_b, parameters={}):
171     """    Establish the Jaro-Winkler edit distance between strings.
172     
173     The Jaro distance is useful for duplicate detection and it is aimed at short
174     strings. It returns a score between 0 and 1 where a higher score signifies 
175     more similar strings.
176 
177     The Jaro distance function requires no additional parameters.
178     """
179 
180     # The algotihm consists of two basic parts: matching and transpositions: 
181     # each of these is extracted into a sub-function.
182     def __matching_chars(string_a, string_b):
183         """    Align characters of one string against those of the other 
184         string. """
185 
186         length_a, length_b = len(string_a), len(string_b)
187 
188         max_dist = max(length_a, length_b) / 2 - 1
189 
190         matching_a = [ False for i in range(0, length_a) ]
191         matching_b = [ False for i in range(0, length_b) ]
192 
193         for i in range(0, length_a):
194             for j in range(0, length_b):
195                 if string_a[i] == string_b[j]:
196                     if i - j  == 0 or abs(i - j) < max_dist:
197                         matching_a[i] = True
198                         matching_b[j] = True
199 
200         return matching_a, matching_b
201 
202     def __transpositions(string_a, string_b, matching_a, matching_b):
203         """    Extract subsequences of alike characters and count those that 
204         cannot be alined as requiring transpositions. """
205 
206         def __make_sequence(string, matching):
207             """    Extract a subsequence where both strings match. """
208 
209             sequence = []
210             for i in range(0, len(string)):
211                 if matching[i]:
212                     sequence.append(string[i])
213             return sequence
214 
215         # Find subsequences.
216         sequence_a = __make_sequence(string_a, matching_a)
217         sequence_b = __make_sequence(string_b, matching_b)
218 
219         half_transpositions = 0
220 
221         # Count half-transpositions.
222         for i in range(0, min(len(sequence_a), len(sequence_b))):
223             #print(len(sequence_a), len(sequence_b), i)
224             if sequence_a[i] != sequence_b[i]:
225                 half_transpositions += 1
226         half_transpositions += abs(len(sequence_a) - len(sequence_b))
227 
228         return half_transpositions / 2
229 
230     matching_a, matching_b = __matching_chars(string_a, string_b)
231 
232     m = float(matching_a.count(True))
233     t = float(__transpositions(string_a, string_b, matching_a, matching_b))
234 
235     len_a = float(len(string_a))
236     len_b = float(len(string_b))
237 
238     module_a = m / len_a if len_a > 0 else 0
239     module_b = m / len_b if len_b > 0 else 0
240     module_t = (m - t) / m if m > 0 else 0
241 
242     return (module_a + module_b + module_t) / 3.0
243 
244 
245 def jaro_winkler_distance(string_a, string_b, parameters={}):
246     """    Establish the Jaro-Winkler edit distance between strings.
247     
248     The Jaro-Winkler distance is useful for duplicate detection and it is aimed
249     at short strings. It returns a score between 0 and 1 where a higher score 
250     signifies more similar strings. 
251 
252     The Jaro-Winkler distance is more favorable to strings which have a common
253     prefix. The length of the prefix is 4. (Parameter name: 'max_prefix'.)
254 
255     The weight given to the similarity of prefixes is set by the scaling factor
256     p which typically is set to 0.1 and should not exceed 0.25 because the 
257     distance could then be larger than 1. (Parameter name: 'p'.)
258     """
259 
260     # Parameters and defaults.    length_b
261     p = parameters['p'] if 'p' in parameters else 0.1
262     max_prefix = parameters['max_prefix'] if 'max_prefix' in parameters else 4
263 
264     # Establish the ordinary Jaro distance.
265     dj = jaro_distance(string_a, string_b, parameters)
266     l = 0
267 
268     # Account for the prefix - words starting similarly are considered more 
269     # similar.    
270     for i in range(0, min(len(string_a), len(string_b), max_prefix)):
271         if string_a[i] == string_b[i]:
272             l += 1
273         else:
274             break
275 
276     # Adjust the distance with the prefix-related score.
277     return dj + (p * l * (1 - dj))
278 
279 
280 def wagner_fischer_distance (string_a, string_b, parameters={}):
281     """    Establish the Levenshtein edit distance between strings using the 
282     Wagner-Fischer dynamic programming algorithm.
283 
284     Algorithm X in the original article.
285 
286     The distance between identical strings is 0, and for each basic 
287     difference between them (any change that needs to be applied to make
288     them identical) the distance grows by 1.
289 
290     Basic operations include: insertion, deletion, substitution.
291 
292     The optional parameter 'cost' is the cost of the substitution operation
293     (1 by default). (Parameter name: 'cost'.)
294     """
295     removal_cost = 1
296     insertion_cost = 1
297     substitution_cost = 1 if 'cost' not in parameters else parameters['cost']
298 
299     length_a = len(string_a) + 1
300     length_b = len(string_b) + 1
301 
302     d = __make_arr_2d(None, height=length_a, width=length_b)
303 
304     d[0][0] = 0
305 
306     # Fill in the defaults for string A.
307     for i in range(1, length_a):
308         d[i][0] = d[i - 1][0] + removal_cost # of removing string_a[i - 1]
309 
310     # Fill in the defaults for string B.
311     for j in range(1, length_b):
312         d[0][j] = d[0][j - 1] + insertion_cost # of inserting string_b[i - 1]
313 
314     # The trace cost function (gamma in the original article): it is simplified
315     # to check only whether the two characters are the same or not. No empty 
316     # strings are expected though.
317     gamma = lambda a, b: 0 if a == b else substitution_cost
318 
319     # Fill in the rest of the array.
320     for i in range(1, length_a):
321         for j in range(1, length_b):
322             m1 = d[i - 1][j - 1] + gamma(string_a[i - 1], string_b[j - 1])
323             m2 = d[i - 1][j] + removal_cost # of removing string_a[i - 1]
324             m3 = d[i][j - 1] + insertion_cost # of inserting string_b[j - 1]
325 
326             d[i][j] = min(m1, m2, m3)
327 
328     return d[length_a - 1][length_b - 1]
329 
330 # All the functions and their names for convenience.
331 API = {
332     'Levenshtein': levenshtein_distance,
333     'Damerau-Levenshtein': damerau_levenshtein_distance,
334     'Hamming': hamming_distance,
335     'Jaro': jaro_distance,
336     'Jaro-Winkler': jaro_winkler_distance,
337     'Wagner-Fischer': wagner_fischer_distance,
338 }
339 
340 # A demonstration.
341 if __name__ == '__main__':
342     from sys import argv, stdout
343 
344     if len(argv) < 3 or len(argv) % 2 < 1:
345         stdout.write("Usage: %s WORD_A WORD_B [ WORD_A WORD_B [...] ]\n" \
346                                                                     % argv[0])
347         exit(-1)
348 
349     for i in range(1, int(len(argv)/2) + 1):
350 
351         word_a = argv[2*i - 1]
352         word_b = argv[2*i]
353         sentence_a, sentence_b = [s.split() for s in [word_a, word_b]]
354 
355         title = 'distance between "%s" and "%s"' % (word_a, word_b)
356         maxlen = max([len(i) for i in API.keys()] + [len(title)])
357 
358         row = '| %' + str(maxlen) + 's | %5.1f | %5.1f |\n'
359         strrow = row.replace('.1f', 's')
360         edge = '%s\n' % ('-' * (29 + maxlen))
361         rule = strrow.replace('|', '+').replace(' ', '-') % \
362                                             (maxlen * '-', 5 * '-', 5 * '-')
363 
364         stdout.write(rule)
365         stdout.write(row.replace('.1f', 's') % (title, 'words', 'lists'))
366         stdout.write(rule)
367         for method in API:
368             word_r = float(API[method](word_a, word_b))
369             sentence_r = float(API[method](sentence_a, sentence_b))
370             stdout.write(row % (method, word_r, sentence_r))
371         stdout.write(rule)
372         stdout.write("\n")
373 

The code is available on GitHub at python/editdist.py.

No comments: