Monday, September 27, 2010

Duplicates

I suspect I might have a number of heavy files on all my disks that are the exact same file, which I had probably copied during some re-installation as backup or something and subsequently forgot. However, I am much too lazy to go through all my disks and actually check if this is the case, and even much less inclined to clean it all up...

So I figured I'd make this script which does it for me, right? It took me a couple of days, but writing scripts is much more fun than cleaning up... Also I had to sit on this train for 3 hours, twice, and then some more at the station, and then in a hotel room. So it was only natural to write something.

It works like this: it looks through all the files in a given directory (recursively) and then for each pair of files it checks whether their contents differ. Actually, first the contents are read in and an MD5 hash is made, and then the hashes are compared, and only if those are equal, then I compare bit-by-bit. If they do equal, then I have found a duplicate and can delete one!

After implementing the script in a sort of rough-and-ready fashion I figured out that some logging would be useful, since it was annoying to wait a long time for the script to terminate just to find some weird garbled results and not knowing why, so I implemented that. Later on, after I ran it a couple more times, I decided that some mechanism for excluding files would also be nice. So I added some options to do just that, either by specifying exact paths or by writing some regular expressions. And then I thought that if I'd really wanted to make a complicate filter then maybe it'd be a good idea to be able create the path list externally and then pipe it to the script... so I did that too. And the script kind of grew a lot. And I had to add all those comments and stuff too.

Enough banter, the code:
 
1 #!/usr/bin/env python
2 # -*- coding: utf-8 -*-
3 #
4 # Duplicates
5 #
6 # A quick and simple script to find files in a directory which have the same
7 # contents as one another. A hash of each files' contents is created and
8 # compared against one another to find identical files. When hashes match the
9 # files' contents are compared bit-by-bit. The script then prints out groups of
10 # files which have the same contents.
11 #
12 # Options:
13 # -h, --help show this help message and exit
14 # --paragraphs Print out final results as paragraphs, where each line
15 # is filename, and each group of identical files is
16 # separated from another by an empty line.
17 # -f FIELD, --field-separator=FIELD
18 # Print out identical files separated from one another
19 # by the specified string. Uses system path separator by
20 # default.
21 # -g GROUP, --group-separator=GROUP
22 # Print out groups of identical files separated from one
23 # another by the specified string. Uses new lines by
24 # default.
25 # -v, --verbose Show more diagnostic messages (none - only errors and
26 # final results, once [-v] - duplicate messages, twice
27 # [-vv] - matching hash messages, four times [-vvvv] -
28 # all possible diagnostic messages.
29 # --hash-only Do not compare duplicate files bit-by-bit if hashes
30 # match
31 # --non-recursive Only look through the files in the directory but do
32 # not descend into subdirectories.
33 # -e EXCLUDES, --exclude=EXCLUDES
34 # Do not search through the files described by this
35 # path.
36 # -r REGEXPS, --exclude-regexp=REGEXPS
37 # Do not search through the files whose paths fit this
38 # regular expression. (Details on regular expressions:
39 # http://docs.python.org/library/re.html)
40 # -s, --stdin Read list of paths from standard input (arguments are
41 # ignored)
42 #
43 # Example:
44 # This is how you go about checking if Steve has any duplicated files in his
45 # home directory:
46 # ./duplicates.py /home/steve
47 #
48 # License:
49 # Copyright (C) 2010 Konrad Siek <konrad.siek@gmail.com>
50 #
51 # This program is free software: you can redistribute it and/or modify it
52 # under the terms of the GNU General Public License version 3, as published
53 # by the Free Software Foundation.
54 #
55 # This program is distributed in the hope that it will be useful, but
56 # WITHOUT ANY WARRANTY; without even the implied warranties of
57 # MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR
58 # PURPOSE. See the GNU General Public License for more details.
59 #
60 # You should have received a copy of the GNU General Public License along
61 # with this program. If not, see <http://www.gnu.org/licenses/>.
62 #
63
64 import os, sys, re
65
66 # Levels of verbocity:
67 # * results - print out the final results formatted as specified by the user,
68 # * errors - show final results and messages from any errors that occur,
69 # * duplicate - print out a message every time a duplicate is found,
70 # * hash - print out an information every time two hashes match
71 # * all - show all diagnostic messages possible (a lot of text, this)
72 SHOW_RESULTS, SHOW_ERRORS, SHOW_DUPLICATE, SHOW_HASH, SHOW_ALL = range(-1,4)
73
74 # The selected level of verbosity will be stored here.
75 global verbosity
76
77 def printerr(level, *args):
78 """ Print an error message if the specified level of verbosity allow it."""
79 if level > verbosity:
80 return
81 from sys import argv, stderr
82 from os.path import basename
83 stderr.write("%s:" % basename(argv[0]))
84 for arg in args:
85 stderr.write(" %s" % arg)
86 stderr.write("\n")
87
88 def listall(root, recursive=True, excludes=[]):
89 """ Traverse a file tree and list all files therein."""
90 from os import listdir
91 from os.path import isdir, abspath, exists, join
92 dir_filter = lambda f: not isdir(f)
93 files = []
94 todo = [abspath(root)]
95 while todo:
96 path = todo.pop()
97 # Check if the file is in the excludion list, and if so, do not
98 # process it further.
99 if matches(excludes, path):
100 printerr(SHOW_ALL, 'Path excluded from comparisons', "'%s'" % path)
101 continue
102 # In case any errors occur just print the message but do not stop
103 # working: results will be less exact, but at least there will be some.
104 try:
105 printerr(SHOW_ALL, 'Found file:', "'%s'" % path)
106 # Ordinary files go onto the file list and will be checked for
107 # duplicates.
108 if not isdir(path):
109 files.append(path)
110 continue
111 # Directories are listed and their contents are put back onto the
112 # todo list, while they themselves will not be checked for
113 # duplicates.
114 contents = [join(path, file) for file in listdir(path)]
115 todo += contents if recursive else filter(dir_filter, contents)
116 except Exception as exception:
117 printerr(SHOW_ERRORS, exception)
118 return files
119
120 def same_file(data_a, data_b):
121 """ Compare the contents of two files bit by bit."""
122 len_a = len(data_a)
123 len_b = len(data_b)
124 if len_a != len_b:
125 return False
126 for i in range(0, len_a):
127 if data_a[i] != data_b[i]:
128 return False
129 return True
130
131 def matches(excludes, path):
132 """ Check if the given path is in the exclusion list, which consists of
133 strings and compiled regular expressions."""
134 for expression in excludes:
135 if type(expression) == str:
136 if path == expression:
137 return True
138 else:
139 if expression.match(path):
140 return True
141 return False
142
143 def read_data(path):
144 """ Read contents of a given file and close the stream."""
145 data_source = open(path, 'rb')
146 data = data_source.read()
147 data_source.close()
148 return data
149
150 def duplicates(paths, onlyhashes=False, excludes=[]):
151 """ For each file in a list of files find its duplicates in that list. A
152 duplicate of file is such that has the same contents. The files are compared
153 first by hashes of its contents and if those match, bit by bit (although the
154 latter can be turned off for a performance increase."""
155 from hashlib import md5
156 hashes = {}
157 duplicates = []
158 for path in paths:
159 printerr(SHOW_ALL, 'Looking for duplicates for', "'%s'" % path)
160 try:
161 data = read_data(path)
162 hash = md5(data).digest()
163 if hash in hashes:
164 other_paths = hashes[hash]
165 duplicated = False
166 for other_path in other_paths:
167 # If only hashes are supposed to be taken into account,
168 # then assume this file is a duplicate and do not process
169 # further.
170 if onlyhashes:
171 duplicates.append((other_path, path))
172 duplicated = True
173 break
174 other_data = read_data(other_path)
175 # Check if files are different despite having the same hash.
176 if same_file(data, other_data):
177 printerr(SHOW_DUPLICATE, 'Found duplicates:', \
178 "'%s'" % path, 'and', "'%s'" % other_path)
179 duplicates.append((other_path, path))
180 duplicated = True
181 if not duplicated:
182 # Same hash but different content.
183 printerr(SHOW_HASH, 'No duplicate found for', "'%s'" % path)
184 hashes[hash].append(path)
185 else:
186 # No matching hash.
187 printerr(SHOW_ALL, 'No duplicate found for', "'%s'" % path)
188 hashes[hash] = [path]
189 except Exception as exception:
190 printerr(SHOW_ERRORS, exception)
191 return duplicates
192
193 def sort(duplicates):
194 """ Organize pairs of duplicates into groups (sets)."""
195 sorts = []
196 for duplicate_a, duplicate_b in duplicates:
197 for sort in sorts:
198 if duplicate_a in sort or duplicate_b in sort:
199 sort.add(duplicate_a)
200 sort.add(duplicate_b)
201 break
202 else:
203 sorts.append(set([duplicate_a, duplicate_b]))
204 return sorts
205
206 def print_results(sorts, separator=os.pathsep, group_separator="\n"):
207 """ Print out sets of results, where each element of a set is one field,
208 separated from others by a field separator, and each set is a record or
209 group, separated from other groups by a group separator."""
210
211 from sys import stdout
212 for sort in sorts:
213 first = True
214 for s in sort:
215 if not first:
216 stdout.write(separator)
217 stdout.write(s)
218 first = False
219 stdout.write(group_separator)
220
221 if __name__ == '__main__':
222 """ The main function: argument handling and all processing start here."""
223
224 from optparse import OptionParser
225 from os.path import basename
226 from sys import argv
227
228 # Prepare user options.
229 usage = '\n%s [OPTIONS] PATH_LIST ' % basename(argv[0])
230
231 description = 'Looks through the specified directory or directories ' + \
232 'for duplicated files. Files are compared primarily by a hash ' + \
233 'created from their contents, and if there\'s a hit, they are ' + \
234 'compared bit-by-bit to ensure correctness.'
235
236 parser = OptionParser(usage=usage, description=description)
237
238 parser.add_option('--paragraphs', action='store_true', dest='paragraphs', \
239 help='Print out final results as paragraphs, where each line is ' + \
240 'filename, and each group of identical files is separated from ' + \
241 'another by an empty line.', default=False)
242 parser.add_option('-f', '--field-separator', action='store', dest='field', \
243 help='Print out identical files separated from one another by the ' + \
244 'specified string. Uses system path separator by default.', \
245 default=os.pathsep)
246 parser.add_option('-g', '--group-separator', action='store', dest='group', \
247 help='Print out groups of identical files separated from one ' + \
248 'another by the specified string. Uses new lines by default.', \
249 default='\n')
250 parser.add_option('-v', '--verbose', action='count', dest='verbosity', \
251 help='Show more diagnostic messages (none - only errors and final ' + \
252 'results, once [-v] - duplicate messages, twice [-vv] - matching ' + \
253 'hash messages, four times [-vvvv] - all possible diagnostic messages.')
254 parser.add_option('--hash-only', action='store_true', dest='hashonly', \
255 help='Do not compare duplicate files bit-by-bit if hashes match', \
256 default=False)
257 parser.add_option('--non-recursive', action='store_false', \
258 help='Only look through the files in the directory but do not ' + \
259 'descend into subdirectories.', default=True, dest='recursive')
260 parser.add_option('-e', '--exclude', action='append', dest='excludes', \
261 help='Do not search through the files described by this path.', \
262 default=[])
263 parser.add_option('-r', '--exclude-regexp', action='append', \
264 dest='regexps', help='Do not search through the files whose paths ' + \
265 'fit this regular expression. (Details on regular expressions: ' + \
266 'http://docs.python.org/library/re.html)', default=[])
267 parser.add_option('-s', '--stdin', action='store_true', dest='stdin', \
268 help='Read list of paths from standard input (arguments are ignored)', \
269 default=False)
270
271 # Gathering option information.
272 opts, args = parser.parse_args()
273 if opts.paragraphs:
274 opts.field = '\n'
275 opts.group = '\n\n'
276 verbosity = opts.verbosity
277
278 # Compiling excluding regular expressions.
279 for regexp in opts.regexps:
280 matcher = re.compile(regexp)
281 opts.excludes.append(matcher)
282
283 files = []
284 if opts.stdin:
285 # User provides paths by standard input, script ignores arguments.
286 from sys import stdin
287 from os.path import exists, abspath
288 printerr(SHOW_ALL, 'Reading file paths from standard input')
289 for line in stdin.readlines():
290 line = line[:-1] # get rid of the trailing new line
291 if exists(line):
292 files.append(abspath(line))
293 continue
294 elif line == '':
295 continue
296 printerr(SHOW_ERRORS, 'File not found', "'%s'," % line, 'skipping')
297 else:
298 # Get file paths by parsing all arguments' file subtrees.
299 if not args:
300 parser.print_help()
301 sys.exit(1)
302 for arg in args:
303 printerr(SHOW_ALL, 'Reading file tree under %s%s' \
304 % (arg, 'recursively' if opts.recursive else ''))
305 files += listall(arg, opts.recursive, opts.excludes)
306
307 # Processing.
308 sorts = sort(duplicates(files, opts.hashonly))
309 print_results(sorts, separator=opts.field, group_separator=opts.group)
310
The code is also available on GitHub at python/duplicates.py.

Thursday, September 16, 2010

Big Red Button

(I seem to have been carried away a little, but you can always skip the intro.)

If you've ever tried fiddling with some electronics on an amateurish level you'd probably thought at some point that it would be awesome if you could plug something into your computer and make it obey your evil commands. And then you'd laugh your evil mad laugh...

Or is it just me?

Anyway, it's a bugger to do that sort of thing with USB, especially if you're only learning which way of the solderer is the bit you hold and which is the bit you hold when you possibly want a day off and some attention of medical personnel. Right, anyway... So USB is a pain because to even turn a simple diode on or off you can't plug it into the USB socket or anything because it's a serial port, so it has a complicated protocol to follow before it spits out any actual information. So you'd probably need to interface with the socket via a FT232R integrated circuit or somerthing, but that's fiddly to hook up even in the best of times, and not as cheap as you'd like them to be.

But if you happen to find a computer which has a parallel port (you know, that pink one which you used to use for the printer) then you're all set, because that's not a serial port. No. It is a parallel port, as name implies. In fact you can plug some 8 diodes and things in just like that and they will receive a nice 5V signal (Hey! Good for TTLs!) when you program the computer to do so. And that's what I'd done and other people have done as well.

Ok now... so how do I control the damn port? Well, you can put together a C program that'll do that for you, or, these days, a Python program, which is fine and dandy for us weird no-life developer nerds, but is definitely not going to work when you make a device for your dad. Back in the day, on those ugly, gray Windows 95 machines we had someone's little window application though, written in Delphi (that awful thing) but is no good for a modern, up-to-date Linux user (also, I can't find the link anymore). And I couldn't really find anything worthwhile for the one true OS...

So something had to be done. Luckily I had some plane trips when I would be motivated (by way of having no alternative occupation except of staring blankly at the seat in front of me) to hack up a little GTK application to serve my own needs, scratch an itch, that sort of thing...

(End of lengthy intro.)

Long story short, I wrote it in Python and using Quickly. I wanted to give it a simple interface, so i figured, an enormous big red button would be the best way to go. True, you can't readily turn on only one bit of the port, but you can't have everything, and this way it won't confuse the people who might not be in the habit of sitting up late at night in the light of a computer screen and wondering if there's a 24/7 pizza place anywhere within walking distance.

Anyway, take a look at this here screenshot and tell me it's not at least amusing...


Feature number two was the ability to turn it on with a delay. So, say you have a device that you turn on, but you want it to turn on in an hour? Right-clickety on the button, set up the timer and you're good to go.

And finally, you can set it up so that it to turn on or off particular pins when you press the button, so you can for instance keep one device turned on in the on position and turn on two other ones, but turn the first one off in the off position.

It can also run with or without GUI, which you can find out about by running it as a script as:
bigredbutton -h

I'm not posting the code here, because it's on the long side, but it is available at GitHub as bigredbutton/ and, as previously stated, at https://launchpad.net/~konrad-siek/+archive/ppa.

Oh, and I'm sure it's all buggy as all hell, 'cuz overtime I keep forgetting which bits I tested and fixed and which worked already. But it's on launchpad, so I'm sure I can fix them as they appear. Hell, maybe I'll even add a feature or two in the fullness of time.

How to install (on Ubuntu): You can...
  1. Open the Ubuntu Software Center, select Edit -> Software Sources...
  2. In the Software Sources dialog select the tab Other Software
  3. There, select add and type in the apt line for the package archive (viz. ppa:konrad-siek/ppa);
  4. Find the program in Get software, in a section like ppa that just appeared.
Another alternative is to do it the gritty way and create a list for apt. That you can do by creating a file in /etc/apt/sources.list.d/ and inside it type in deb http://ppa.launchpad.net/konrad-siek/ppa/ubuntu lucid main. You substitute lucid for whatever version of Ubuntu you happen to be running. Then you can install the program via whatever package management tool you want, e.g. apt:
sudo apt-get update && sudo apt-get install bigredbutton
Done.

Where's Mah Intertubes

By popular demand (of a single cheesecakemonger)

Have you ever signed up with a shoddy Internet provider, where you get sudden and frequent connection failures that prevent your router from, well, routing? Are you getting tired of trying to stare down your modem in wait for the Internet connection to get up again after one of these failures? Maybe you'd prefer there to be some kind of script that you could use to be instantly notified when the connection gets back on, so you can both instantly resume obsessively browsing ICHC and get out of the living room into the kitchen and maybe do some dishes as your provider persists that crappy service is just what you need?

Well, this is the script you are looking for. You can run it after your connection went down and it'll play some sound when the connection is back again. That'd be if you run it (using some awesome sound effects from Battle for Wesnoth) as:
./wheresmahintertubes.py \
--yay=/usr/share/games/wesnoth/1.8/data/core/music/defeat.ogg \
--boo=/usr/share/games/wesnoth/1.8/data/core/music/victory.ogg &

Hell, you can even put it into your startup applications in Gnome or something (I figure making it into an init.d start-up script would be a bit overkill). Simplest way I can think of to do that in Ubuntu would be to go to System -> Preferences -> Starup Applications, there press Add and put the same command as above into the field aptly named Command.

The script has zero innovative mechanics: it tries to connect to Google by IP every 30 seconds and then plays an appropriate bit of accidental music through PyGame's mixer.

I figured pinging Google like that is the simplest way to determine whether the connection is up or down. Sure, it's not perfect, but it'll work well enough with the benefit of working within 5 minutes of me starting to write the script. Anyway, the timeout, the address, and the delay between checks can all be configured using the friendly commandline interface (i.e. command switches: --timeout, --uri, --delay).

Using PyGame is also a bit suspect in this application (overkill, mostly) and I did initially plan it to use Gstreamer but PyGame is 4 times less code written on my end, but neither requires suspiciously GTK-related libraries, nor does it mysteriously hang up or disobey orders... Indeed, it just works, and chances are you already have it installed on a lot of systems anyway, because you wanted to play Slingshot st some point...

So, here's the code. Enjoy.
 
1 #!/usr/bin/python
2 # -*- coding: utf-8 -*-
3 #
4 # Where's Mah Intertubes?!
5 #
6 # A simple script that lets you know when your connection goes down or comes
7 # back up with sounds.
8 #
9 # Depends:
10 # espeak (if you don't want to use sound files)
11 # pygame (if you do)
12 #
13 # Options:
14 # -h, --help show this help message and exit
15 # --speak recite a message when connection goes on or off
16 # (default)
17 # --off=AUDIO_OFF, --boo=AUDIO_OFF
18 # set a sound played when connection is lost
19 # --on=AUDIO_ON, --yay=AUDIO_ON
20 # set a sound played when connection is back on
21 # -t TIMEOUT, --timeout=TIMEOUT
22 # set timeout for checking if connection works (default:
23 # 10s)
24 # -d DELAY, --delay=DELAY
25 # set delay between connection checks (default: 30s)
26 # -u URI, --uri=URI, --url=URI
27 # ping this URI to see if connection works (default:
28 # http://74.125.77.147)
29 # -v, --verbose display information about things done by the program
30 #
31 # Examples:
32 # Let's say you want to check if you can connect and you're fine with the
33 # espeak dude to moan about it instead fo using a cool sound you can use one
34 # of the following (let it run in the background):
35 #
36 # ./wheresmahintertubes.py &
37 # ./wheresmahintertubes.py --speak &
38 #
39 # If you want some proper fun sounds then all you need is point them out:
40 #
41 # ./wheresmahintertubes.py --yay=file/for_on.ogg --boo=file/for_off.mp3 &
42 #
43 # License:
44 # Copyright (C) 2010 Konrad Siek <konrad.siek@gmail.com>
45 #
46 # This program is free software: you can redistribute it and/or modify it
47 # under the terms of the GNU General Public License version 3, as published
48 # by the Free Software Foundation.
49 #
50 # This program is distributed in the hope that it will be useful, but
51 # WITHOUT ANY WARRANTY; without even the implied warranties of
52 # MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR
53 # PURPOSE. See the GNU General Public License for more details.
54 #
55 # You should have received a copy of the GNU General Public License along
56 # with this program. If not, see <http://www.gnu.org/licenses/>.
57 #
58
59 import time
60 import sys
61 import pygame
62
63 quiet = False
64 uri = "http://74.125.77.147"
65
66 def printerr(*args):
67 if quiet:
68 return
69 from sys import argv, stderr
70 from os.path import basename
71 stderr.write("%s:" % basename(argv[0]))
72 for arg in args:
73 stderr.write(" %s" % arg)
74 stderr.write("\n")
75
76 def printout(*args):
77 if quiet:
78 return
79 from sys import argv, stdout
80 from os.path import basename
81 stdout.write("%s:" % basename(argv[0]))
82 for arg in args:
83 stdout.write(" %s" % arg)
84 stdout.write("\n")
85
86 def can_has_connection(timeout):
87 import urllib2
88 printout("Checking ping to", uri)
89 try:
90 urllib2.urlopen(uri, timeout=timeout) # Google
91 except urllib2.URLError as error:
92 return False
93 return True
94
95 class Speaker:
96 def __init__(self):
97 self.library = {}
98
99 def add_to_library(self, key, path):
100 self.library[key] = path
101
102 def play_from_library(self, key):
103 if not key in self.library:
104 return False
105 self.play(self.library[key])
106 return True
107
108 def play(self, message):
109 from os import system
110 system('espeak %s' % message)
111
112 class PyGamePlayer:
113 def __init__(self):
114 pygame.init()
115 self.library = {}
116
117 def add_to_library(self, key, path):
118 self.library[key] = path
119
120 def play_from_library(self, key):
121 if not key in self.library:
122 return False
123 self.play(self.library[key])
124 return True
125
126 def play(self, path):
127 pygame.mixer.Sound(path).play()
128
129 class Checker:
130 def __init__(self, delay, timeout):
131 self.delay = delay
132 self.timeout = timeout
133
134 def run(self):
135 connected = can_has_connection(self.timeout)
136 printout('Initially the connection is', 'on' if connected else 'off')
137 while True:
138 time.sleep(self.delay)
139 current = can_has_connection(self.timeout)
140 if connected != current:
141 connected = current
142 self.react(connected)
143
144 def react(self, connected):
145 from threading import Thread
146 printout('The connection just went', 'on' if connected else 'off')
147 def run():
148 self.player.play_from_library(connected)
149 thread = Thread()
150 thread.run = run
151 thread.start()
152
153 if __name__ == '__main__':
154 from optparse import OptionParser
155 from os.path import basename
156 from sys import argv
157
158 usage = '\n%s [OPTIONS] ' % basename(argv[0]) + \
159 '--on=[SOUND FILE] --off=[SOUND FILE]\n' + \
160 '\tplay a sound when network connection goes up or down' + \
161 '\n%s [OPTIONS] ' % basename(argv[0]) + '--speak\n' + \
162 '\trecite a message when network connection goes up or down (boring...)'
163
164 description = 'Wait around and periodically check if the connection ' + \
165 'went up or down, and if that happens play an appropriate sound to ' + \
166 'indicate it to the user. Network connectivity is check by the ' + \
167 'the simple method of connecting a specific host address, and ' + \
168 'assuming that the entwork is down if it takes too much time for ' + \
169 'that host to respond.'
170
171 parser = OptionParser(usage=usage, description=description)
172
173 parser.add_option('--speak', action='store_true', dest='speak', \
174 help='recite a message when connection goes on or off (default)')
175 parser.add_option('--off', '--boo', action='store', dest='audio_off', \
176 help='set a sound played when connection is lost')
177 parser.add_option('--on', '--yay', action='store', dest='audio_on', \
178 help='set a sound played when connection is back on')
179 parser.add_option('-t', '--timeout', action='store', dest='timeout', \
180 help='set timeout for checking if connection works (default: 10s)', \
181 default=10, type='int')
182 parser.add_option('-d', '--delay', action='store', dest='delay', \
183 help='set delay between connection checks (default: 30s)', \
184 default=30, type='int')
185 parser.add_option('-u', '--uri', '--url', action='store', dest='uri', \
186 help='ping this URI to see if connection works (default: %s)' % uri, \
187 default=uri)
188 parser.add_option('-v', '--verbose', action='store_true', dest='verbose', \
189 help='display information about things done by the program')
190
191 opts, args = parser.parse_args()
192
193 quiet = not opts.verbose
194 uri = opts.uri
195
196 player = None
197
198 if opts.speak or not opts.audio_on or not opts.audio_off:
199 player = Speaker()
200 player.add_to_library(True, "Connection just went up")
201 player.add_to_library(False, "Connection just went down")
202 else:
203 player = PyGamePlayer()
204 player.add_to_library(True, opts.audio_on)
205 player.add_to_library(False, opts.audio_off)
206
207 checker = Checker(delay=opts.delay, timeout=opts.timeout)
208 checker.player = player
209
210 try:
211 checker.run()
212 except (KeyboardInterrupt, SystemExit):
213 printout('Exiting...')
214 running = False
The code is also available on GitHub at python/wheresmahintertubes.py.

Wednesday, July 7, 2010

OCD

OCD - the obsessive-compulsive daemon for checking websites for changes...

Have you ever waited for results of an exam or something that were meant to appear on a website somewhere? This script is supposed to spare you pressing CTRL+R every minute, by downloading the page every so often and comparing it with some prior version.

I actually don't do that sort of thing much anymore, but I'd seen some people (well, students) rapidly refreshing recently and felt that I could try to save them from themselves.

Anyway, onto the technical stuff.

Whereas doing that in a bash script using wget and whatnot will probably take around 20 lines or so, I really wanted to add support for Basic authentication and for selecting just parts of the website for comparison with XPath. The first of these can be done with wget but the latter requires a bit more work in bash in my experience. Also, I figured if I did it in Python maybe it'd be a bit more portable, in case somebody actually wants to use the script but for some strange reason wants to use that icky bashless operating system (and cannot install cygwin as well).

The modus operandi is rather simple: first, the original version is downloaded (line 215), then a giant while-true loop (line 224) downloads a new version (230) every so-and-so seconds (line 226). Depending on whether it's supposed to use xpath or not the download will use a different function (the one on line 104or the one on line 120) and will return different types of objects (text or list-but this is mostly unimportant). The download functions either use straightforward urllib2 to open a resource directly, or build an opener with a username and password, if these are defined. Everything else is pretty much decoration... you know - parameter parsing, communication with the user.

And then, there's the code:


1 #!/usr/bin/env python
2 # -*- coding: utf-8 -*-
3 #
4 # Obsessive-compulsive daemon.
5 #
6 # The obsessive-compulsive daemon for website checking. It checks the website
7 # religiously every three minutes so you don't have to! Never again do you have
8 # to toil in front of the monitor for week after that tricky exam just pressing
9 # 'Refresh': employ our machine slaves to do it for you! No hassle*!
10 #
11 # The script is supposed to run in the background and download the webpage under
12 # scrutiny with a given frequency and compare it with the version it downloaded
13 # the first time. If a change is detected, the script runs a custom command.
14 #
15 # * Unless they organize a bloody revolution to overthrow their meaty
16 # oppressors... and rightfully so.
17 #
18 # Depends:
19 # ┬ÁTidylib <apt:python-utidylib>
20 #
21 # Usage:
22 # ocd [OPTIONS] URI
23 #
24 # Options:
25 # -c COMMAND, --command=COMMAND
26 # Specify a command to run on change.
27 # -f FORMAT, --format=FORMAT
28 # Specify how arguments are passed to the command.
29 # Available placeholders: %uri - the URI of the observed
30 # website, %old - original content, %new - changed
31 # content.
32 # -x XPATH, --xpath=XPATH
33 # Provide a path to interesting elements in the observed
34 # document.
35 # -s SECONDS, --sleep-time=SECONDS
36 # Set the time in seconds between checks (downloads).
37 # --continue Do not stop when change is detected. Instead run
38 # specified command and continue checking.
39 # -u USER, --user=USER
40 # Specify a username for the website.
41 # -P PASSWORD, --pass=PASSWORD, --password=PASSWORD
42 # Give a password for the website.
43 # -p, --prompt Prompt for login and password.
44 # -v, --verbose Print script progress information to stderr.
45 # -h, --help Show detailed usage information.
46 #
47 # Examples:
48 # The simplest use-case: watch for changes in a webpage (as a whole) and check
49 # it every minute. The ampersand will cause it to be run in the background in
50 # bash.
51 #
52 # ./ocd.py http://www.example.com/ &
53 #
54 # Debugging and such: same as above, but loudmouth mode.
55 #
56 # ./ocd.py -v http://www.example.com/ &
57 #
58 # Watch a page for changes in a specific table element, with a login and
59 # password, and checking it every 20 seconds.
60 #
61 # ./ocd.py -ps 20 -x "//td[@class='strong']/div/span" \
62 # "https://usosweb.amu.edu.pl/kontroler.php?_action=actionx:dla_stud/studia/oceny/index()" &
63 #
64 # Run a specific command on change: here, speak that the URI changed.
65 #
66 # ./ocd.py -c espeak -f "Changes found in resource: %uri" \
67 # http://www.cs.put.poznan.pl/ksiek/sop/resources/embrace_change.php &
68 #
69 # Here's a hint: if you run something in the commandline in the background
70 # using the ampersand you can then turn the terminal off by pressing CTRL + D.
71 #
72 # License:
73 # Copyright (C) 2010 Konrad Siek <konrad.siek@gmail.com>
74 #
75 # This program is free software: you can redistribute it and/or modify it
76 # under the terms of the GNU General Public License version 3, as published
77 # by the Free Software Foundation.
78 #
79 # This program is distributed in the hope that it will be useful, but
80 # WITHOUT ANY WARRANTY; without even the implied warranties of
81 # MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR
82 # PURPOSE. See the GNU General Public License for more details.
83 #
84 # You should have received a copy of the GNU General Public License along
85 # with this program. If not, see <http://www.gnu.org/licenses/>.
86 #
87
88 # Internalization.
89 import gettext
90 from gettext import gettext as _
91 gettext.textdomain('ocd')
92
93 def _ssl_opener(uri, user, password):
94 """ Prepare an open function to handle the specific URI with Basic
95 authentication with and user/password."""
96
97 password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
98 password_manager.add_password(None, uri, user, password)
99 handler = urllib2.HTTPBasicAuthHandler(password_manager)
100 opener = urllib2.build_opener(handler)
101
102 return opener.open
103
104 def download(uri, user=None, password=None):
105 """ Dowloads the specified document and cleans it up."""
106
107 import urllib2, tidy
108 from StringIO import StringIO
109
110 opener = urllib2.urlopen if None in [user, password] \
111 else _ssl_opener(uri, user, password)
112 raw_resource = ''.join(opener(uri).readlines())
113 tidy_doc = tidy.parseString(raw_resource, output_xhtml=1, add_xml_decl=1,
114 indent=1, output_encoding='utf8')
115 resource = StringIO()
116 tidy_doc.write(resource)
117
118 return resource.getvalue()
119
120 def downloadx(uri, xpath, user=None, password=None, getcontent=True):
121 """ Downloads the specified xpath elements from the document.
122
123 If getcontent is set to True, the element list will be converted to string
124 before returning, otherwise a list of xmlNode objects is returned."""
125
126 import libxml2
127
128 resource = download(uri)
129 document = libxml2.htmlParseDoc(resource, None)
130 document.xpathNewContext()
131 elements = document.xpathEval(xpath)
132
133 if getcontent:
134 return map(lambda e: e.get_content(), elements)
135 else:
136 return elements
137
138 def getcredentials(user=None):
139 """ Prompt the user for whatever credentials are still missing.
140
141 If the username is given then just a prompt for password is shown to the
142 user, nd otherwise, the user is asked for both the password and username."""
143
144 from sys import stderr, stdin
145 from getpass import getpass
146
147 if user is None:
148 stderr.write(_('Username: '))
149 user = stdin.readline().strip()
150 password = getpass(_('Password: '), stream=stderr)
151
152 return user, password
153
154 def _print(string, verbose):
155 """ A shorthand to print debugging information out if the verbose option is
156 set to True. This pparticular implementation is a bit costly."""
157
158 if verbose:
159 from sys import stderr, argv
160 from os.path import basename
161 stderr.write("%s: %s\n" % (basename(argv[0]), string));
162
163 def run(uri, effect, user=None, password=None, verbose=False, xpath=None,
164 sleeptime=60, stopondifference=True, comparator=lambda x, y: x == y,
165 prompt=False):
166 """ The main part of the script: runs comparisons in a loop.
167
168 Here, credentials are gathered, the original version of the observed
169 resource is downloaded, and the script sleeps for the specified time,
170 downloads and compares new versions of the resource and checks for changes.
171
172 The URI specifies the address of the resource, or webpage, or whatever, that
173 will be observed to find changes.
174
175 The effect parameter is a function that will be run when a change is found.
176 This should be a function that takes three arguments: URI, old content, and
177 new content. The old content and new content arguments may be of type List
178 (if xpath is used) or String (if it is not used).
179
180 Sleep time may be specified in seconds, controlling the time the loop will
181 wait between downloading each new (potentially changed) version of the page.
182
183 An XPath query can be given that specifies the elements that should be
184 compared with version changes instead of a whole page. (Information on
185 XPath: http://www.w3schools.com/XPath/.)
186
187 If verbose is set, debugging messages are produced on stderr.
188
189 If stopondifference is set, the script will stop running if a change is
190 found, otherwise, the changed version becomes the new original and checking
191 is continued.
192
193 If prompt is set, the user is asked for password and login if necessary;
194 otherwise the login and password are used as they are.
195
196 A custom comparator may be specified. The comparator takes two arguments of
197 type list (if xpath is used) or string (if it isn't) and should return a
198 boolean."""
199
200 from time import sleep
201
202 # Ask the user for password if necessary.
203 if prompt:
204 getcredentials(user)
205
206 # Prepare a shorthand function for downloading versions.
207 def _retrieve():
208 if xpath:
209 return downloadx(uri, xpath, user, password)
210 else:
211 return download(uri, user, password)
212
213 # Download the original version of the remote resource.
214 _print(_("Downloading base version of resource %s.") % uri, verbose)
215 comparison = _retrieve()
216
217 print comparison
218
219 if verbose and xpath:
220 # Just print some stuff out to stderr.
221 elements = ', '.join(map(lambda x: '"%s"' % x, comparison))
222 _print(_("Elements at XPath %s: %s.") % (xpath, elements), verbose)
223
224 while True:
225 _print(_("Sleeping %s seconds.") % sleeptime, verbose)
226 sleep(sleeptime)
227
228 # Grab a more current version of the resource.
229 _print(_("Downloading resource %s.") % uri, verbose)
230 current = _retrieve()
231
232 if verbose and xpath:
233 # Just print some stuff out to stderr.
234 elements = ', '.join(map(lambda x: '"%s"' % x, current))
235 _print(_("Elements at XPath %s: %s.") % (xpath, elements), verbose)
236
237 # Compare old and new version of the resource.
238 if not comparator(comparison, current):
239 # Run
240 _print(_("Resource changed!"), verbose)
241 effect(uri, comparison, current)
242
243 if stopondifference:
244 break
245 else:
246 comparison = current
247 continue
248
249 _print(_("No changes."), verbose)
250
251 def create_handler_command(name, format):
252 """ Create a function that wraps a command for use with the main loop."""
253
254 from os import popen
255 from sys import stdout
256
257 def run_command(uri, old, new):
258 formatted = format
259 formatted = formatted.replace("%uri", str(uri))
260 formatted = formatted.replace("%old", str(old))
261 formatted = formatted.replace("%new", str(new))
262 command = "%s %s" % (name, formatted)
263 content = ''.join(popen(command).readlines())
264 stdout.write(content)
265 return content
266
267 return run_command
268
269 if __name__ == '__main__':
270 """ Parse commandline options and start checking for changes."""
271
272 from optparse import OptionParser, OptionGroup
273 from sys import argv, exit
274 from os.path import basename
275
276 # Prepare the parser.
277 usage = '%s [OPTIONS] URI' % basename(argv[0])
278 parser = OptionParser(usage=usage)
279
280 # Options that control the process of checking the website.
281 querying = OptionGroup(parser, 'Querying options')
282 querying.add_option('-c', '--command', metavar='COMMAND', dest='command',
283 default='echo', help='Specify a command to run on change.')
284 querying.add_option('-f', '--format', metavar='FORMAT', dest='format',
285 help='Specify how arguments are passed to the command. ' +
286 'Available placeholders: ' + '%uri - the URI of the observed website, '
287 '%old - original content, ' + '%new - changed content.', default='%uri')
288 querying.add_option('-x', '--xpath', metavar='XPATH', dest='xpath',
289 help='Provide a path to interesting elements in the observed document.',
290 default=None)
291 querying.add_option('-s', '--sleep-time', metavar='SECONDS', dest='sleep',
292 help='Set the time in seconds between checks (downloads).', default=60,
293 type='float')
294 querying.add_option('--continue', action='store_true',
295 default=False, help='Do not stop when change is detected. Instead ' +
296 'run specified command and continue checking.', dest='notstop')
297 parser.add_option_group(querying)
298
299 # SSL and authentication options.
300 security = OptionGroup(parser, 'Security options')
301 security.add_option('-u', '--user', metavar='USER', dest="user", \
302 default=None, help='Specify a username for the website.')
303 security.add_option('-P', '--pass', '--password', metavar='PASSWORD', \
304 dest="password", default=None, help='Give a password for the website.')
305 security.add_option('-p', '--prompt', dest='prompt', action="store_true",
306 default=False, help='Prompt for login and password.')
307 parser.add_option_group(security)
308
309 # Options that don't fit into other categories.
310 other = OptionGroup(parser, "Other options")
311 other.add_option('-v', '--verbose', dest='verbose', action="store_true",
312 default=False, help='Print script progress information to stderr.')
313 parser.add_option_group(other)
314
315 opts, args = parser.parse_args()
316
317 # Check arguments
318 if len(args) < 1:
319 _print(_("Nothing to check: quitting..."), opts.verbose)
320 exit(0)
321
322 if len(args) > 1:
323 arguments = ', '.join(args[1:])
324 _print(_("Arguments %s are ignored.") % arguments, opts.verbose)
325
326 # Let us begin to commence!
327 try:
328 run(
329 args[0], create_handler_command(opts.command, opts.format),
330 user=opts.user, password=opts.password, prompt=opts.prompt,
331 verbose=opts.verbose, sleeptime=opts.sleep,
332 stopondifference=(not opts.notstop), xpath=opts.xpath
333 )
334 except (KeyboardInterrupt, SystemExit):
335 _print(_("Closed by the user."), opts.verbose)
336

The code is also available on GitHub at python/ocd.py.