Monday, September 27, 2010

Duplicates

I suspect I might have a number of heavy files on all my disks that are the exact same file, which I had probably copied during some re-installation as backup or something and subsequently forgot. However, I am much too lazy to go through all my disks and actually check if this is the case, and even much less inclined to clean it all up...

So I figured I'd make this script which does it for me, right? It took me a couple of days, but writing scripts is much more fun than cleaning up... Also I had to sit on this train for 3 hours, twice, and then some more at the station, and then in a hotel room. So it was only natural to write something.

It works like this: it looks through all the files in a given directory (recursively) and then for each pair of files it checks whether their contents differ. Actually, first the contents are read in and an MD5 hash is made, and then the hashes are compared, and only if those are equal, then I compare bit-by-bit. If they do equal, then I have found a duplicate and can delete one!

After implementing the script in a sort of rough-and-ready fashion I figured out that some logging would be useful, since it was annoying to wait a long time for the script to terminate just to find some weird garbled results and not knowing why, so I implemented that. Later on, after I ran it a couple more times, I decided that some mechanism for excluding files would also be nice. So I added some options to do just that, either by specifying exact paths or by writing some regular expressions. And then I thought that if I'd really wanted to make a complicate filter then maybe it'd be a good idea to be able create the path list externally and then pipe it to the script... so I did that too. And the script kind of grew a lot. And I had to add all those comments and stuff too.

Enough banter, the code:
 
1 #!/usr/bin/env python
2 # -*- coding: utf-8 -*-
3 #
4 # Duplicates
5 #
6 # A quick and simple script to find files in a directory which have the same
7 # contents as one another. A hash of each files' contents is created and
8 # compared against one another to find identical files. When hashes match the
9 # files' contents are compared bit-by-bit. The script then prints out groups of
10 # files which have the same contents.
11 #
12 # Options:
13 # -h, --help show this help message and exit
14 # --paragraphs Print out final results as paragraphs, where each line
15 # is filename, and each group of identical files is
16 # separated from another by an empty line.
17 # -f FIELD, --field-separator=FIELD
18 # Print out identical files separated from one another
19 # by the specified string. Uses system path separator by
20 # default.
21 # -g GROUP, --group-separator=GROUP
22 # Print out groups of identical files separated from one
23 # another by the specified string. Uses new lines by
24 # default.
25 # -v, --verbose Show more diagnostic messages (none - only errors and
26 # final results, once [-v] - duplicate messages, twice
27 # [-vv] - matching hash messages, four times [-vvvv] -
28 # all possible diagnostic messages.
29 # --hash-only Do not compare duplicate files bit-by-bit if hashes
30 # match
31 # --non-recursive Only look through the files in the directory but do
32 # not descend into subdirectories.
33 # -e EXCLUDES, --exclude=EXCLUDES
34 # Do not search through the files described by this
35 # path.
36 # -r REGEXPS, --exclude-regexp=REGEXPS
37 # Do not search through the files whose paths fit this
38 # regular expression. (Details on regular expressions:
39 # http://docs.python.org/library/re.html)
40 # -s, --stdin Read list of paths from standard input (arguments are
41 # ignored)
42 #
43 # Example:
44 # This is how you go about checking if Steve has any duplicated files in his
45 # home directory:
46 # ./duplicates.py /home/steve
47 #
48 # License:
49 # Copyright (C) 2010 Konrad Siek <konrad.siek@gmail.com>
50 #
51 # This program is free software: you can redistribute it and/or modify it
52 # under the terms of the GNU General Public License version 3, as published
53 # by the Free Software Foundation.
54 #
55 # This program is distributed in the hope that it will be useful, but
56 # WITHOUT ANY WARRANTY; without even the implied warranties of
57 # MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR
58 # PURPOSE. See the GNU General Public License for more details.
59 #
60 # You should have received a copy of the GNU General Public License along
61 # with this program. If not, see <http://www.gnu.org/licenses/>.
62 #
63
64 import os, sys, re
65
66 # Levels of verbocity:
67 # * results - print out the final results formatted as specified by the user,
68 # * errors - show final results and messages from any errors that occur,
69 # * duplicate - print out a message every time a duplicate is found,
70 # * hash - print out an information every time two hashes match
71 # * all - show all diagnostic messages possible (a lot of text, this)
72 SHOW_RESULTS, SHOW_ERRORS, SHOW_DUPLICATE, SHOW_HASH, SHOW_ALL = range(-1,4)
73
74 # The selected level of verbosity will be stored here.
75 global verbosity
76
77 def printerr(level, *args):
78 """ Print an error message if the specified level of verbosity allow it."""
79 if level > verbosity:
80 return
81 from sys import argv, stderr
82 from os.path import basename
83 stderr.write("%s:" % basename(argv[0]))
84 for arg in args:
85 stderr.write(" %s" % arg)
86 stderr.write("\n")
87
88 def listall(root, recursive=True, excludes=[]):
89 """ Traverse a file tree and list all files therein."""
90 from os import listdir
91 from os.path import isdir, abspath, exists, join
92 dir_filter = lambda f: not isdir(f)
93 files = []
94 todo = [abspath(root)]
95 while todo:
96 path = todo.pop()
97 # Check if the file is in the excludion list, and if so, do not
98 # process it further.
99 if matches(excludes, path):
100 printerr(SHOW_ALL, 'Path excluded from comparisons', "'%s'" % path)
101 continue
102 # In case any errors occur just print the message but do not stop
103 # working: results will be less exact, but at least there will be some.
104 try:
105 printerr(SHOW_ALL, 'Found file:', "'%s'" % path)
106 # Ordinary files go onto the file list and will be checked for
107 # duplicates.
108 if not isdir(path):
109 files.append(path)
110 continue
111 # Directories are listed and their contents are put back onto the
112 # todo list, while they themselves will not be checked for
113 # duplicates.
114 contents = [join(path, file) for file in listdir(path)]
115 todo += contents if recursive else filter(dir_filter, contents)
116 except Exception as exception:
117 printerr(SHOW_ERRORS, exception)
118 return files
119
120 def same_file(data_a, data_b):
121 """ Compare the contents of two files bit by bit."""
122 len_a = len(data_a)
123 len_b = len(data_b)
124 if len_a != len_b:
125 return False
126 for i in range(0, len_a):
127 if data_a[i] != data_b[i]:
128 return False
129 return True
130
131 def matches(excludes, path):
132 """ Check if the given path is in the exclusion list, which consists of
133 strings and compiled regular expressions."""
134 for expression in excludes:
135 if type(expression) == str:
136 if path == expression:
137 return True
138 else:
139 if expression.match(path):
140 return True
141 return False
142
143 def read_data(path):
144 """ Read contents of a given file and close the stream."""
145 data_source = open(path, 'rb')
146 data = data_source.read()
147 data_source.close()
148 return data
149
150 def duplicates(paths, onlyhashes=False, excludes=[]):
151 """ For each file in a list of files find its duplicates in that list. A
152 duplicate of file is such that has the same contents. The files are compared
153 first by hashes of its contents and if those match, bit by bit (although the
154 latter can be turned off for a performance increase."""
155 from hashlib import md5
156 hashes = {}
157 duplicates = []
158 for path in paths:
159 printerr(SHOW_ALL, 'Looking for duplicates for', "'%s'" % path)
160 try:
161 data = read_data(path)
162 hash = md5(data).digest()
163 if hash in hashes:
164 other_paths = hashes[hash]
165 duplicated = False
166 for other_path in other_paths:
167 # If only hashes are supposed to be taken into account,
168 # then assume this file is a duplicate and do not process
169 # further.
170 if onlyhashes:
171 duplicates.append((other_path, path))
172 duplicated = True
173 break
174 other_data = read_data(other_path)
175 # Check if files are different despite having the same hash.
176 if same_file(data, other_data):
177 printerr(SHOW_DUPLICATE, 'Found duplicates:', \
178 "'%s'" % path, 'and', "'%s'" % other_path)
179 duplicates.append((other_path, path))
180 duplicated = True
181 if not duplicated:
182 # Same hash but different content.
183 printerr(SHOW_HASH, 'No duplicate found for', "'%s'" % path)
184 hashes[hash].append(path)
185 else:
186 # No matching hash.
187 printerr(SHOW_ALL, 'No duplicate found for', "'%s'" % path)
188 hashes[hash] = [path]
189 except Exception as exception:
190 printerr(SHOW_ERRORS, exception)
191 return duplicates
192
193 def sort(duplicates):
194 """ Organize pairs of duplicates into groups (sets)."""
195 sorts = []
196 for duplicate_a, duplicate_b in duplicates:
197 for sort in sorts:
198 if duplicate_a in sort or duplicate_b in sort:
199 sort.add(duplicate_a)
200 sort.add(duplicate_b)
201 break
202 else:
203 sorts.append(set([duplicate_a, duplicate_b]))
204 return sorts
205
206 def print_results(sorts, separator=os.pathsep, group_separator="\n"):
207 """ Print out sets of results, where each element of a set is one field,
208 separated from others by a field separator, and each set is a record or
209 group, separated from other groups by a group separator."""
210
211 from sys import stdout
212 for sort in sorts:
213 first = True
214 for s in sort:
215 if not first:
216 stdout.write(separator)
217 stdout.write(s)
218 first = False
219 stdout.write(group_separator)
220
221 if __name__ == '__main__':
222 """ The main function: argument handling and all processing start here."""
223
224 from optparse import OptionParser
225 from os.path import basename
226 from sys import argv
227
228 # Prepare user options.
229 usage = '\n%s [OPTIONS] PATH_LIST ' % basename(argv[0])
230
231 description = 'Looks through the specified directory or directories ' + \
232 'for duplicated files. Files are compared primarily by a hash ' + \
233 'created from their contents, and if there\'s a hit, they are ' + \
234 'compared bit-by-bit to ensure correctness.'
235
236 parser = OptionParser(usage=usage, description=description)
237
238 parser.add_option('--paragraphs', action='store_true', dest='paragraphs', \
239 help='Print out final results as paragraphs, where each line is ' + \
240 'filename, and each group of identical files is separated from ' + \
241 'another by an empty line.', default=False)
242 parser.add_option('-f', '--field-separator', action='store', dest='field', \
243 help='Print out identical files separated from one another by the ' + \
244 'specified string. Uses system path separator by default.', \
245 default=os.pathsep)
246 parser.add_option('-g', '--group-separator', action='store', dest='group', \
247 help='Print out groups of identical files separated from one ' + \
248 'another by the specified string. Uses new lines by default.', \
249 default='\n')
250 parser.add_option('-v', '--verbose', action='count', dest='verbosity', \
251 help='Show more diagnostic messages (none - only errors and final ' + \
252 'results, once [-v] - duplicate messages, twice [-vv] - matching ' + \
253 'hash messages, four times [-vvvv] - all possible diagnostic messages.')
254 parser.add_option('--hash-only', action='store_true', dest='hashonly', \
255 help='Do not compare duplicate files bit-by-bit if hashes match', \
256 default=False)
257 parser.add_option('--non-recursive', action='store_false', \
258 help='Only look through the files in the directory but do not ' + \
259 'descend into subdirectories.', default=True, dest='recursive')
260 parser.add_option('-e', '--exclude', action='append', dest='excludes', \
261 help='Do not search through the files described by this path.', \
262 default=[])
263 parser.add_option('-r', '--exclude-regexp', action='append', \
264 dest='regexps', help='Do not search through the files whose paths ' + \
265 'fit this regular expression. (Details on regular expressions: ' + \
266 'http://docs.python.org/library/re.html)', default=[])
267 parser.add_option('-s', '--stdin', action='store_true', dest='stdin', \
268 help='Read list of paths from standard input (arguments are ignored)', \
269 default=False)
270
271 # Gathering option information.
272 opts, args = parser.parse_args()
273 if opts.paragraphs:
274 opts.field = '\n'
275 opts.group = '\n\n'
276 verbosity = opts.verbosity
277
278 # Compiling excluding regular expressions.
279 for regexp in opts.regexps:
280 matcher = re.compile(regexp)
281 opts.excludes.append(matcher)
282
283 files = []
284 if opts.stdin:
285 # User provides paths by standard input, script ignores arguments.
286 from sys import stdin
287 from os.path import exists, abspath
288 printerr(SHOW_ALL, 'Reading file paths from standard input')
289 for line in stdin.readlines():
290 line = line[:-1] # get rid of the trailing new line
291 if exists(line):
292 files.append(abspath(line))
293 continue
294 elif line == '':
295 continue
296 printerr(SHOW_ERRORS, 'File not found', "'%s'," % line, 'skipping')
297 else:
298 # Get file paths by parsing all arguments' file subtrees.
299 if not args:
300 parser.print_help()
301 sys.exit(1)
302 for arg in args:
303 printerr(SHOW_ALL, 'Reading file tree under %s%s' \
304 % (arg, 'recursively' if opts.recursive else ''))
305 files += listall(arg, opts.recursive, opts.excludes)
306
307 # Processing.
308 sorts = sort(duplicates(files, opts.hashonly))
309 print_results(sorts, separator=opts.field, group_separator=opts.group)
310
The code is also available on GitHub at python/duplicates.py.

No comments: