Gritty Scripts of Death: Duplicates

I suspect I might have a number of heavy files on all my disks that are the exact same file, which I had probably copied during some re-installation as backup or something and subsequently forgot. However, I am much too lazy to go through all my disks and actually check if this is the case, and even much less inclined to clean it all up...

So I figured I'd make this script which does it for me, right? It took me a couple of days, but writing scripts is much more fun than cleaning up... Also I had to sit on this train for 3 hours, twice, and then some more at the station, and then in a hotel room. So it was only natural to write something.

It works like this: it looks through all the files in a given directory (recursively) and then for each pair of files it checks whether their contents differ. Actually, first the contents are read in and an MD5 hash is made, and then the hashes are compared, and only if those are equal, then I compare bit-by-bit. If they do equal, then I have found a duplicate and can delete one!

After implementing the script in a sort of rough-and-ready fashion I figured out that some logging would be useful, since it was annoying to wait a long time for the script to terminate just to find some weird garbled results and not knowing why, so I implemented that. Later on, after I ran it a couple more times, I decided that some mechanism for excluding files would also be nice. So I added some options to do just that, either by specifying exact paths or by writing some regular expressions. And then I thought that if I'd really wanted to make a complicate filter then maybe it'd be a good idea to be able create the path list externally and then pipe it to the script... so I did that too. And the script kind of grew a lot. And I had to add all those comments and stuff too.

Enough banter, the code:

 
  1 #!/usr/bin/env python 
  2 # -*- coding: utf-8 -*- 
  3 # 
  4 # Duplicates 
  5 # 
  6 # A quick and simple script to find files in a directory which have the same 
  7 # contents as one another. A hash of each files' contents is created and  
  8 # compared against one another to find identical files. When hashes match the 
  9 # files' contents are compared bit-by-bit. The script then prints out groups of 
 10 # files which have the same contents. 
 11 # 
 12 # Options: 
 13 #   -h, --help            show this help message and exit 
 14 #   --paragraphs          Print out final results as paragraphs, where each line 
 15 #                         is filename, and each group of identical files is 
 16 #                         separated from another by an empty line. 
 17 #   -f FIELD, --field-separator=FIELD 
 18 #                         Print out identical files separated from one another 
 19 #                         by the specified string. Uses system path separator by 
 20 #                         default. 
 21 #   -g GROUP, --group-separator=GROUP 
 22 #                         Print out groups of identical files separated from one 
 23 #                         another by the specified string. Uses new lines by 
 24 #                         default. 
 25 #   -v, --verbose         Show more diagnostic messages (none - only errors and 
 26 #                         final results, once [-v] - duplicate messages, twice 
 27 #                         [-vv] - matching hash messages, four times [-vvvv] - 
 28 #                         all possible diagnostic messages. 
 29 #   --hash-only           Do not compare duplicate files bit-by-bit if hashes 
 30 #                         match 
 31 #   --non-recursive       Only look through the files in the directory but do 
 32 #                         not descend into subdirectories. 
 33 #   -e EXCLUDES, --exclude=EXCLUDES 
 34 #                         Do not search through the files described by this 
 35 #                         path. 
 36 #   -r REGEXPS, --exclude-regexp=REGEXPS 
 37 #                         Do not search through the files whose paths fit this 
 38 #                         regular expression. (Details on regular expressions: 
 39 #                         http://docs.python.org/library/re.html) 
 40 #   -s, --stdin           Read list of paths from standard input (arguments are 
 41 #                         ignored) 
 42 # 
 43 # Example: 
 44 #   This is how you go about checking if Steve has any duplicated files in his 
 45 #   home directory: 
 46 #       ./duplicates.py /home/steve 
 47 # 
 48 # License: 
 49 #   Copyright (C) 2010 Konrad Siek <konrad.siek@gmail.com> 
 50 # 
 51 #   This program is free software: you can redistribute it and/or modify it  
 52 #   under the terms of the GNU General Public License version 3, as published  
 53 #   by the Free Software Foundation. 
 54 #  
 55 #   This program is distributed in the hope that it will be useful, but  
 56 #   WITHOUT ANY WARRANTY; without even the implied warranties of  
 57 #   MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR  
 58 #   PURPOSE.  See the GNU General Public License for more details. 
 59 #  
 60 #   You should have received a copy of the GNU General Public License along  
 61 #   with this program.  If not, see <http://www.gnu.org/licenses/>. 
 62 # 
 63  
 64 import os, sys, re
 65  
 66 # Levels of verbocity: 
 67 #   * results - print out the final results formatted as specified by the user, 
 68 #   * errors - show final results and messages from any errors that occur, 
 69 #   * duplicate - print out a message every time a duplicate is found, 
 70 #   * hash - print out an information every time two hashes match 
 71 #   * all - show all diagnostic messages possible (a lot of text, this) 
 72 SHOW_RESULTS, SHOW_ERRORS, SHOW_DUPLICATE, SHOW_HASH, SHOW_ALL = range(-1,4)
 73  
 74 # The selected level of verbosity will be stored here. 
 75 global verbosity
 76  
 77 def printerr(level, *args):
 78     """ Print an error message if the specified level of verbosity allow it.""" 
 79     if level > verbosity:
 80         return 
 81     from sys import argv, stderr
 82     from os.path import basename
 83     stderr.write("%s:" % basename(argv[0]))
 84     for arg in args:
 85         stderr.write(" %s" % arg)
 86     stderr.write("\n")
 87  
 88 def listall(root, recursive=True, excludes=[]):
 89     """ Traverse a file tree and list all files therein.""" 
 90     from os import listdir
 91     from os.path import isdir, abspath, exists, join
 92     dir_filter = lambda f: not isdir(f)
 93     files = []
 94     todo = [abspath(root)]
 95     while todo:
 96         path = todo.pop()
 97         # Check if the file is in the excludion list, and if so, do not  
 98         # process it further. 
 99         if matches(excludes, path):
100             printerr(SHOW_ALL, 'Path excluded from comparisons', "'%s'" % path)
101             continue 
102         # In case any errors occur just print the message but do not stop  
103         # working: results will be less exact, but at least there will be some. 
104         try:
105             printerr(SHOW_ALL, 'Found file:', "'%s'" % path)
106             # Ordinary files go onto the file list and will be checked for  
107             # duplicates.  
108             if not isdir(path):
109                 files.append(path)
110                 continue 
111             # Directories are listed and their contents are put back onto the 
112             # todo list, while they themselves will not be checked for  
113             # duplicates. 
114             contents = [join(path, file) for file in listdir(path)]
115             todo += contents if recursive else filter(dir_filter, contents)
116         except Exception as exception:
117             printerr(SHOW_ERRORS, exception)
118     return files
119  
120 def same_file(data_a, data_b):
121     """ Compare the contents of two files bit by bit.""" 
122     len_a = len(data_a)
123     len_b = len(data_b)
124     if len_a !=  len_b:
125         return False 
126     for i in range(0, len_a):
127         if data_a[i] != data_b[i]:
128             return False 
129     return True 
130  
131 def matches(excludes, path):
132     """ Check if the given path is in the exclusion list, which consists of  
133     strings and compiled regular expressions.""" 
134     for expression in excludes:
135         if type(expression) == str:
136             if path == expression:
137                 return True 
138         else:
139             if expression.match(path):
140                 return True 
141     return False 
142  
143 def read_data(path):
144     """ Read contents of a given file and close the stream.""" 
145     data_source = open(path, 'rb')
146     data = data_source.read()
147     data_source.close()
148     return data
149  
150 def duplicates(paths, onlyhashes=False, excludes=[]):
151     """ For each file in a list of files find its duplicates in that list. A  
152     duplicate of file is such that has the same contents. The files are compared 
153     first by hashes of its contents and if those match, bit by bit (although the 
154     latter can be turned off for a performance increase.""" 
155     from hashlib import md5
156     hashes = {}
157     duplicates = []
158     for path in paths:
159         printerr(SHOW_ALL, 'Looking for duplicates for', "'%s'" % path)
160         try:
161             data = read_data(path)
162             hash = md5(data).digest()
163             if hash in hashes:
164                 other_paths = hashes[hash]
165                 duplicated = False 
166                 for other_path in other_paths:
167                     # If only hashes are supposed to be taken into account,  
168                     # then assume this file is a duplicate and do not process  
169                     # further. 
170                     if onlyhashes:
171                         duplicates.append((other_path, path))
172                         duplicated = True 
173                         break 
174                     other_data = read_data(other_path)
175                     # Check if files are different despite having the same hash. 
176                     if same_file(data, other_data):
177                         printerr(SHOW_DUPLICATE, 'Found duplicates:', \ 
178                             "'%s'" % path, 'and', "'%s'" % other_path)
179                         duplicates.append((other_path, path))
180                         duplicated = True 
181                 if not duplicated:
182                     # Same hash but different content. 
183                     printerr(SHOW_HASH, 'No duplicate found for', "'%s'" % path)
184                     hashes[hash].append(path)
185             else:
186                 # No matching hash. 
187                 printerr(SHOW_ALL, 'No duplicate found for', "'%s'" % path)
188                 hashes[hash] = [path]
189         except Exception as exception:
190             printerr(SHOW_ERRORS, exception)
191     return duplicates
192  
193 def sort(duplicates):
194     """ Organize pairs of duplicates into groups (sets).""" 
195     sorts = []
196     for duplicate_a, duplicate_b in duplicates:
197         for sort in sorts:
198             if duplicate_a in sort or duplicate_b in sort:
199                 sort.add(duplicate_a)
200                 sort.add(duplicate_b)
201                 break 
202         else:
203             sorts.append(set([duplicate_a, duplicate_b]))
204     return sorts
205  
206 def print_results(sorts, separator=os.pathsep, group_separator="\n"):
207     """ Print out sets of results, where each element of a set is one field, 
208     separated from others by a field separator, and each set is a record or  
209     group, separated from other groups by a group separator.""" 
210  
211     from sys import stdout
212     for sort in sorts:
213         first = True 
214         for s in sort:
215             if not first:
216                 stdout.write(separator)
217             stdout.write(s)
218             first = False 
219         stdout.write(group_separator)
220  
221 if __name__ == '__main__':
222     """ The main function: argument handling and all processing start here.""" 
223  
224     from optparse import OptionParser
225     from os.path import basename
226     from sys import argv
227  
228     # Prepare user options. 
229     usage = '\n%s [OPTIONS] PATH_LIST ' % basename(argv[0])
230  
231     description = 'Looks through the specified directory or directories ' + \ 
232         'for duplicated files. Files are compared primarily by a hash ' + \ 
233         'created from their contents, and if there\'s a hit, they are ' + \ 
234         'compared bit-by-bit to ensure correctness.' 
235  
236     parser = OptionParser(usage=usage, description=description)
237  
238     parser.add_option('--paragraphs', action='store_true', dest='paragraphs', \ 
239         help='Print out final results as paragraphs, where each line is ' + \ 
240         'filename, and each group of identical files is separated from ' + \ 
241         'another by an empty line.', default=False)
242     parser.add_option('-f', '--field-separator', action='store', dest='field', \ 
243         help='Print out identical files separated from one another by the ' + \ 
244         'specified string. Uses system path separator by default.', \ 
245         default=os.pathsep)
246     parser.add_option('-g', '--group-separator', action='store', dest='group', \ 
247         help='Print out groups of identical files separated from one ' + \ 
248         'another by the specified string. Uses new lines by default.', \ 
249         default='\n')
250     parser.add_option('-v', '--verbose', action='count', dest='verbosity', \ 
251         help='Show more diagnostic messages (none - only errors and final ' + \ 
252         'results, once [-v] - duplicate messages, twice [-vv] - matching ' + \ 
253         'hash messages, four times [-vvvv] - all possible diagnostic messages.')
254     parser.add_option('--hash-only', action='store_true', dest='hashonly', \ 
255         help='Do not compare duplicate files bit-by-bit if hashes match', \ 
256         default=False)
257     parser.add_option('--non-recursive', action='store_false', \ 
258         help='Only look through the files in the directory but do not ' + \ 
259         'descend into subdirectories.', default=True, dest='recursive')
260     parser.add_option('-e', '--exclude', action='append', dest='excludes', \ 
261         help='Do not search through the files described by this path.', \ 
262         default=[])
263     parser.add_option('-r', '--exclude-regexp', action='append', \ 
264         dest='regexps', help='Do not search through the files whose paths ' + \ 
265         'fit this regular expression. (Details on regular expressions: ' + \ 
266         'http://docs.python.org/library/re.html)', default=[])
267     parser.add_option('-s', '--stdin', action='store_true', dest='stdin', \ 
268         help='Read list of paths from standard input (arguments are ignored)', \ 
269         default=False)
270  
271     # Gathering option information. 
272     opts, args = parser.parse_args()
273     if opts.paragraphs:
274         opts.field = '\n' 
275         opts.group = '\n\n' 
276     verbosity = opts.verbosity
277  
278     # Compiling excluding regular expressions. 
279     for regexp in opts.regexps:
280         matcher = re.compile(regexp)
281         opts.excludes.append(matcher)
282  
283     files = []
284     if opts.stdin:
285         # User provides paths by standard input, script ignores arguments. 
286         from sys import stdin
287         from os.path import exists, abspath
288         printerr(SHOW_ALL, 'Reading file paths from standard input')
289         for line in stdin.readlines():
290             line = line[:-1] # get rid of the trailing new line 
291             if exists(line):
292                 files.append(abspath(line))
293                 continue 
294             elif line == '':
295                 continue 
296             printerr(SHOW_ERRORS, 'File not found', "'%s'," % line, 'skipping')
297     else:
298         # Get file paths by parsing all arguments' file subtrees. 
299         if not args:
300             parser.print_help()
301             sys.exit(1)
302         for arg in args:
303             printerr(SHOW_ALL, 'Reading file tree under %s%s' \ 
304                 % (arg, 'recursively' if opts.recursive else ''))
305             files += listall(arg, opts.recursive, opts.excludes)
306  
307     # Processing. 
308     sorts = sort(duplicates(files, opts.hashonly))
309     print_results(sorts, separator=opts.field, group_separator=opts.group)
310

The code is also available on GitHub at python/duplicates.py.

Gritty Scripts of Death

Monday, September 27, 2010

Duplicates

No comments: