Gritty Scripts of Death: 2010

Monday, September 27, 2010

Duplicates

I suspect I might have a number of heavy files on all my disks that are the exact same file, which I had probably copied during some re-installation as backup or something and subsequently forgot. However, I am much too lazy to go through all my disks and actually check if this is the case, and even much less inclined to clean it all up...

So I figured I'd make this script which does it for me, right? It took me a couple of days, but writing scripts is much more fun than cleaning up... Also I had to sit on this train for 3 hours, twice, and then some more at the station, and then in a hotel room. So it was only natural to write something.

It works like this: it looks through all the files in a given directory (recursively) and then for each pair of files it checks whether their contents differ. Actually, first the contents are read in and an MD5 hash is made, and then the hashes are compared, and only if those are equal, then I compare bit-by-bit. If they do equal, then I have found a duplicate and can delete one!

After implementing the script in a sort of rough-and-ready fashion I figured out that some logging would be useful, since it was annoying to wait a long time for the script to terminate just to find some weird garbled results and not knowing why, so I implemented that. Later on, after I ran it a couple more times, I decided that some mechanism for excluding files would also be nice. So I added some options to do just that, either by specifying exact paths or by writing some regular expressions. And then I thought that if I'd really wanted to make a complicate filter then maybe it'd be a good idea to be able create the path list externally and then pipe it to the script... so I did that too. And the script kind of grew a lot. And I had to add all those comments and stuff too.

Enough banter, the code:

 
  1 #!/usr/bin/env python 
  2 # -*- coding: utf-8 -*- 
  3 # 
  4 # Duplicates 
  5 # 
  6 # A quick and simple script to find files in a directory which have the same 
  7 # contents as one another. A hash of each files' contents is created and  
  8 # compared against one another to find identical files. When hashes match the 
  9 # files' contents are compared bit-by-bit. The script then prints out groups of 
 10 # files which have the same contents. 
 11 # 
 12 # Options: 
 13 #   -h, --help            show this help message and exit 
 14 #   --paragraphs          Print out final results as paragraphs, where each line 
 15 #                         is filename, and each group of identical files is 
 16 #                         separated from another by an empty line. 
 17 #   -f FIELD, --field-separator=FIELD 
 18 #                         Print out identical files separated from one another 
 19 #                         by the specified string. Uses system path separator by 
 20 #                         default. 
 21 #   -g GROUP, --group-separator=GROUP 
 22 #                         Print out groups of identical files separated from one 
 23 #                         another by the specified string. Uses new lines by 
 24 #                         default. 
 25 #   -v, --verbose         Show more diagnostic messages (none - only errors and 
 26 #                         final results, once [-v] - duplicate messages, twice 
 27 #                         [-vv] - matching hash messages, four times [-vvvv] - 
 28 #                         all possible diagnostic messages. 
 29 #   --hash-only           Do not compare duplicate files bit-by-bit if hashes 
 30 #                         match 
 31 #   --non-recursive       Only look through the files in the directory but do 
 32 #                         not descend into subdirectories. 
 33 #   -e EXCLUDES, --exclude=EXCLUDES 
 34 #                         Do not search through the files described by this 
 35 #                         path. 
 36 #   -r REGEXPS, --exclude-regexp=REGEXPS 
 37 #                         Do not search through the files whose paths fit this 
 38 #                         regular expression. (Details on regular expressions: 
 39 #                         http://docs.python.org/library/re.html) 
 40 #   -s, --stdin           Read list of paths from standard input (arguments are 
 41 #                         ignored) 
 42 # 
 43 # Example: 
 44 #   This is how you go about checking if Steve has any duplicated files in his 
 45 #   home directory: 
 46 #       ./duplicates.py /home/steve 
 47 # 
 48 # License: 
 49 #   Copyright (C) 2010 Konrad Siek <konrad.siek@gmail.com> 
 50 # 
 51 #   This program is free software: you can redistribute it and/or modify it  
 52 #   under the terms of the GNU General Public License version 3, as published  
 53 #   by the Free Software Foundation. 
 54 #  
 55 #   This program is distributed in the hope that it will be useful, but  
 56 #   WITHOUT ANY WARRANTY; without even the implied warranties of  
 57 #   MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR  
 58 #   PURPOSE.  See the GNU General Public License for more details. 
 59 #  
 60 #   You should have received a copy of the GNU General Public License along  
 61 #   with this program.  If not, see <http://www.gnu.org/licenses/>. 
 62 # 
 63  
 64 import os, sys, re
 65  
 66 # Levels of verbocity: 
 67 #   * results - print out the final results formatted as specified by the user, 
 68 #   * errors - show final results and messages from any errors that occur, 
 69 #   * duplicate - print out a message every time a duplicate is found, 
 70 #   * hash - print out an information every time two hashes match 
 71 #   * all - show all diagnostic messages possible (a lot of text, this) 
 72 SHOW_RESULTS, SHOW_ERRORS, SHOW_DUPLICATE, SHOW_HASH, SHOW_ALL = range(-1,4)
 73  
 74 # The selected level of verbosity will be stored here. 
 75 global verbosity
 76  
 77 def printerr(level, *args):
 78     """ Print an error message if the specified level of verbosity allow it.""" 
 79     if level > verbosity:
 80         return 
 81     from sys import argv, stderr
 82     from os.path import basename
 83     stderr.write("%s:" % basename(argv[0]))
 84     for arg in args:
 85         stderr.write(" %s" % arg)
 86     stderr.write("\n")
 87  
 88 def listall(root, recursive=True, excludes=[]):
 89     """ Traverse a file tree and list all files therein.""" 
 90     from os import listdir
 91     from os.path import isdir, abspath, exists, join
 92     dir_filter = lambda f: not isdir(f)
 93     files = []
 94     todo = [abspath(root)]
 95     while todo:
 96         path = todo.pop()
 97         # Check if the file is in the excludion list, and if so, do not  
 98         # process it further. 
 99         if matches(excludes, path):
100             printerr(SHOW_ALL, 'Path excluded from comparisons', "'%s'" % path)
101             continue 
102         # In case any errors occur just print the message but do not stop  
103         # working: results will be less exact, but at least there will be some. 
104         try:
105             printerr(SHOW_ALL, 'Found file:', "'%s'" % path)
106             # Ordinary files go onto the file list and will be checked for  
107             # duplicates.  
108             if not isdir(path):
109                 files.append(path)
110                 continue 
111             # Directories are listed and their contents are put back onto the 
112             # todo list, while they themselves will not be checked for  
113             # duplicates. 
114             contents = [join(path, file) for file in listdir(path)]
115             todo += contents if recursive else filter(dir_filter, contents)
116         except Exception as exception:
117             printerr(SHOW_ERRORS, exception)
118     return files
119  
120 def same_file(data_a, data_b):
121     """ Compare the contents of two files bit by bit.""" 
122     len_a = len(data_a)
123     len_b = len(data_b)
124     if len_a !=  len_b:
125         return False 
126     for i in range(0, len_a):
127         if data_a[i] != data_b[i]:
128             return False 
129     return True 
130  
131 def matches(excludes, path):
132     """ Check if the given path is in the exclusion list, which consists of  
133     strings and compiled regular expressions.""" 
134     for expression in excludes:
135         if type(expression) == str:
136             if path == expression:
137                 return True 
138         else:
139             if expression.match(path):
140                 return True 
141     return False 
142  
143 def read_data(path):
144     """ Read contents of a given file and close the stream.""" 
145     data_source = open(path, 'rb')
146     data = data_source.read()
147     data_source.close()
148     return data
149  
150 def duplicates(paths, onlyhashes=False, excludes=[]):
151     """ For each file in a list of files find its duplicates in that list. A  
152     duplicate of file is such that has the same contents. The files are compared 
153     first by hashes of its contents and if those match, bit by bit (although the 
154     latter can be turned off for a performance increase.""" 
155     from hashlib import md5
156     hashes = {}
157     duplicates = []
158     for path in paths:
159         printerr(SHOW_ALL, 'Looking for duplicates for', "'%s'" % path)
160         try:
161             data = read_data(path)
162             hash = md5(data).digest()
163             if hash in hashes:
164                 other_paths = hashes[hash]
165                 duplicated = False 
166                 for other_path in other_paths:
167                     # If only hashes are supposed to be taken into account,  
168                     # then assume this file is a duplicate and do not process  
169                     # further. 
170                     if onlyhashes:
171                         duplicates.append((other_path, path))
172                         duplicated = True 
173                         break 
174                     other_data = read_data(other_path)
175                     # Check if files are different despite having the same hash. 
176                     if same_file(data, other_data):
177                         printerr(SHOW_DUPLICATE, 'Found duplicates:', \ 
178                             "'%s'" % path, 'and', "'%s'" % other_path)
179                         duplicates.append((other_path, path))
180                         duplicated = True 
181                 if not duplicated:
182                     # Same hash but different content. 
183                     printerr(SHOW_HASH, 'No duplicate found for', "'%s'" % path)
184                     hashes[hash].append(path)
185             else:
186                 # No matching hash. 
187                 printerr(SHOW_ALL, 'No duplicate found for', "'%s'" % path)
188                 hashes[hash] = [path]
189         except Exception as exception:
190             printerr(SHOW_ERRORS, exception)
191     return duplicates
192  
193 def sort(duplicates):
194     """ Organize pairs of duplicates into groups (sets).""" 
195     sorts = []
196     for duplicate_a, duplicate_b in duplicates:
197         for sort in sorts:
198             if duplicate_a in sort or duplicate_b in sort:
199                 sort.add(duplicate_a)
200                 sort.add(duplicate_b)
201                 break 
202         else:
203             sorts.append(set([duplicate_a, duplicate_b]))
204     return sorts
205  
206 def print_results(sorts, separator=os.pathsep, group_separator="\n"):
207     """ Print out sets of results, where each element of a set is one field, 
208     separated from others by a field separator, and each set is a record or  
209     group, separated from other groups by a group separator.""" 
210  
211     from sys import stdout
212     for sort in sorts:
213         first = True 
214         for s in sort:
215             if not first:
216                 stdout.write(separator)
217             stdout.write(s)
218             first = False 
219         stdout.write(group_separator)
220  
221 if __name__ == '__main__':
222     """ The main function: argument handling and all processing start here.""" 
223  
224     from optparse import OptionParser
225     from os.path import basename
226     from sys import argv
227  
228     # Prepare user options. 
229     usage = '\n%s [OPTIONS] PATH_LIST ' % basename(argv[0])
230  
231     description = 'Looks through the specified directory or directories ' + \ 
232         'for duplicated files. Files are compared primarily by a hash ' + \ 
233         'created from their contents, and if there\'s a hit, they are ' + \ 
234         'compared bit-by-bit to ensure correctness.' 
235  
236     parser = OptionParser(usage=usage, description=description)
237  
238     parser.add_option('--paragraphs', action='store_true', dest='paragraphs', \ 
239         help='Print out final results as paragraphs, where each line is ' + \ 
240         'filename, and each group of identical files is separated from ' + \ 
241         'another by an empty line.', default=False)
242     parser.add_option('-f', '--field-separator', action='store', dest='field', \ 
243         help='Print out identical files separated from one another by the ' + \ 
244         'specified string. Uses system path separator by default.', \ 
245         default=os.pathsep)
246     parser.add_option('-g', '--group-separator', action='store', dest='group', \ 
247         help='Print out groups of identical files separated from one ' + \ 
248         'another by the specified string. Uses new lines by default.', \ 
249         default='\n')
250     parser.add_option('-v', '--verbose', action='count', dest='verbosity', \ 
251         help='Show more diagnostic messages (none - only errors and final ' + \ 
252         'results, once [-v] - duplicate messages, twice [-vv] - matching ' + \ 
253         'hash messages, four times [-vvvv] - all possible diagnostic messages.')
254     parser.add_option('--hash-only', action='store_true', dest='hashonly', \ 
255         help='Do not compare duplicate files bit-by-bit if hashes match', \ 
256         default=False)
257     parser.add_option('--non-recursive', action='store_false', \ 
258         help='Only look through the files in the directory but do not ' + \ 
259         'descend into subdirectories.', default=True, dest='recursive')
260     parser.add_option('-e', '--exclude', action='append', dest='excludes', \ 
261         help='Do not search through the files described by this path.', \ 
262         default=[])
263     parser.add_option('-r', '--exclude-regexp', action='append', \ 
264         dest='regexps', help='Do not search through the files whose paths ' + \ 
265         'fit this regular expression. (Details on regular expressions: ' + \ 
266         'http://docs.python.org/library/re.html)', default=[])
267     parser.add_option('-s', '--stdin', action='store_true', dest='stdin', \ 
268         help='Read list of paths from standard input (arguments are ignored)', \ 
269         default=False)
270  
271     # Gathering option information. 
272     opts, args = parser.parse_args()
273     if opts.paragraphs:
274         opts.field = '\n' 
275         opts.group = '\n\n' 
276     verbosity = opts.verbosity
277  
278     # Compiling excluding regular expressions. 
279     for regexp in opts.regexps:
280         matcher = re.compile(regexp)
281         opts.excludes.append(matcher)
282  
283     files = []
284     if opts.stdin:
285         # User provides paths by standard input, script ignores arguments. 
286         from sys import stdin
287         from os.path import exists, abspath
288         printerr(SHOW_ALL, 'Reading file paths from standard input')
289         for line in stdin.readlines():
290             line = line[:-1] # get rid of the trailing new line 
291             if exists(line):
292                 files.append(abspath(line))
293                 continue 
294             elif line == '':
295                 continue 
296             printerr(SHOW_ERRORS, 'File not found', "'%s'," % line, 'skipping')
297     else:
298         # Get file paths by parsing all arguments' file subtrees. 
299         if not args:
300             parser.print_help()
301             sys.exit(1)
302         for arg in args:
303             printerr(SHOW_ALL, 'Reading file tree under %s%s' \ 
304                 % (arg, 'recursively' if opts.recursive else ''))
305             files += listall(arg, opts.recursive, opts.excludes)
306  
307     # Processing. 
308     sorts = sort(duplicates(files, opts.hashonly))
309     print_results(sorts, separator=opts.field, group_separator=opts.group)
310

The code is also available on GitHub at python/duplicates.py.

Thursday, September 16, 2010

Big Red Button

(I seem to have been carried away a little, but you can always skip the intro.)

If you've ever tried fiddling with some electronics on an amateurish level you'd probably thought at some point that it would be awesome if you could plug something into your computer and make it obey your evil commands. And then you'd laugh your evil mad laugh...

Or is it just me?

Anyway, it's a bugger to do that sort of thing with USB, especially if you're only learning which way of the solderer is the bit you hold and which is the bit you hold when you possibly want a day off and some attention of medical personnel. Right, anyway... So USB is a pain because to even turn a simple diode on or off you can't plug it into the USB socket or anything because it's a serial port, so it has a complicated protocol to follow before it spits out any actual information. So you'd probably need to interface with the socket via a FT232R integrated circuit or somerthing, but that's fiddly to hook up even in the best of times, and not as cheap as you'd like them to be.

But if you happen to find a computer which has a parallel port (you know, that pink one which you used to use for the printer) then you're all set, because that's not a serial port. No. It is a parallel port, as name implies. In fact you can plug some 8 diodes and things in just like that and they will receive a nice 5V signal (Hey! Good for TTLs!) when you program the computer to do so. And that's what I'd done and other people have done as well.

Ok now... so how do I control the damn port? Well, you can put together a C program that'll do that for you, or, these days, a Python program, which is fine and dandy for us weird no-life developer nerds, but is definitely not going to work when you make a device for your dad. Back in the day, on those ugly, gray Windows 95 machines we had someone's little window application though, written in Delphi (that awful thing) but is no good for a modern, up-to-date Linux user (also, I can't find the link anymore). And I couldn't really find anything worthwhile for the one true OS...

So something had to be done. Luckily I had some plane trips when I would be motivated (by way of having no alternative occupation except of staring blankly at the seat in front of me) to hack up a little GTK application to serve my own needs, scratch an itch, that sort of thing...

(End of lengthy intro.)

Long story short, I wrote it in Python and using Quickly. I wanted to give it a simple interface, so i figured, an enormous big red button would be the best way to go. True, you can't readily turn on only one bit of the port, but you can't have everything, and this way it won't confuse the people who might not be in the habit of sitting up late at night in the light of a computer screen and wondering if there's a 24/7 pizza place anywhere within walking distance.

Anyway, take a look at this here screenshot and tell me it's not at least amusing...

Feature number two was the ability to turn it on with a delay. So, say you have a device that you turn on, but you want it to turn on in an hour? Right-clickety on the button, set up the timer and you're good to go.

And finally, you can set it up so that it to turn on or off particular pins when you press the button, so you can for instance keep one device turned on in the on position and turn on two other ones, but turn the first one off in the off position.

It can also run with or without GUI, which you can find out about by running it as a script as:

bigredbutton -h

I'm not posting the code here, because it's on the long side, but it is available at GitHub as bigredbutton/ and, as previously stated, at https://launchpad.net/~konrad-siek/+archive/ppa.

Oh, and I'm sure it's all buggy as all hell, 'cuz overtime I keep forgetting which bits I tested and fixed and which worked already. But it's on launchpad, so I'm sure I can fix them as they appear. Hell, maybe I'll even add a feature or two in the fullness of time.

How to install (on Ubuntu): You can...

Open the Ubuntu Software Center, select Edit -> Software Sources...
In the Software Sources dialog select the tab Other Software
There, select add and type in the apt line for the package archive (viz. ppa:konrad-siek/ppa);
Find the program in Get software, in a section like ppa that just appeared.

Another alternative is to do it the gritty way and create a list for apt. That you can do by creating a file in /etc/apt/sources.list.d/ and inside it type in deb http://ppa.launchpad.net/konrad-siek/ppa/ubuntu lucid main. You substitute lucid for whatever version of Ubuntu you happen to be running. Then you can install the program via whatever package management tool you want, e.g. apt:

sudo apt-get update && sudo apt-get install bigredbutton

Done.

Where's Mah Intertubes

By popular demand (of a single cheesecakemonger)

Have you ever signed up with a shoddy Internet provider, where you get sudden and frequent connection failures that prevent your router from, well, routing? Are you getting tired of trying to stare down your modem in wait for the Internet connection to get up again after one of these failures? Maybe you'd prefer there to be some kind of script that you could use to be instantly notified when the connection gets back on, so you can both instantly resume obsessively browsing ICHC and get out of the living room into the kitchen and maybe do some dishes as your provider persists that crappy service is just what you need?

Well, this is the script you are looking for. You can run it after your connection went down and it'll play some sound when the connection is back again. That'd be if you run it (using some awesome sound effects from Battle for Wesnoth) as:

./wheresmahintertubes.py \
    --yay=/usr/share/games/wesnoth/1.8/data/core/music/defeat.ogg \
    --boo=/usr/share/games/wesnoth/1.8/data/core/music/victory.ogg &

Hell, you can even put it into your startup applications in Gnome or something (I figure making it into an init.d start-up script would be a bit overkill). Simplest way I can think of to do that in Ubuntu would be to go to System -> Preferences -> Starup Applications, there press Add and put the same command as above into the field aptly named Command.

The script has zero innovative mechanics: it tries to connect to Google by IP every 30 seconds and then plays an appropriate bit of accidental music through PyGame's mixer.

I figured pinging Google like that is the simplest way to determine whether the connection is up or down. Sure, it's not perfect, but it'll work well enough with the benefit of working within 5 minutes of me starting to write the script. Anyway, the timeout, the address, and the delay between checks can all be configured using the friendly commandline interface (i.e. command switches: --timeout, --uri, --delay).

Using PyGame is also a bit suspect in this application (overkill, mostly) and I did initially plan it to use Gstreamer but PyGame is 4 times less code written on my end, but neither requires suspiciously GTK-related libraries, nor does it mysteriously hang up or disobey orders... Indeed, it just works, and chances are you already have it installed on a lot of systems anyway, because you wanted to play Slingshot st some point...

So, here's the code. Enjoy.

 
  1 #!/usr/bin/python 
  2 # -*- coding: utf-8 -*- 
  3 # 
  4 # Where's Mah Intertubes?! 
  5 # 
  6 # A simple script that lets you know when your connection goes down or comes  
  7 # back up with sounds.  
  8 # 
  9 # Depends: 
 10 #   espeak                (if you don't want to use sound files) 
 11 #   pygame                (if you do) 
 12 # 
 13 # Options: 
 14 #   -h, --help            show this help message and exit 
 15 #   --speak               recite a message when connection goes on or off 
 16 #                         (default) 
 17 #   --off=AUDIO_OFF, --boo=AUDIO_OFF 
 18 #                         set a sound played when connection is lost 
 19 #   --on=AUDIO_ON, --yay=AUDIO_ON 
 20 #                         set a sound played when connection is back on 
 21 #   -t TIMEOUT, --timeout=TIMEOUT 
 22 #                         set timeout for checking if connection works (default: 
 23 #                         10s) 
 24 #   -d DELAY, --delay=DELAY 
 25 #                         set delay between connection checks (default: 30s) 
 26 #   -u URI, --uri=URI, --url=URI 
 27 #                         ping this URI to see if connection works (default: 
 28 #                         http://74.125.77.147) 
 29 #   -v, --verbose         display information about things done by the program 
 30 # 
 31 # Examples: 
 32 #   Let's say you want to check if you can connect and you're fine with the  
 33 #   espeak dude to moan about it instead fo using a cool sound you can use one 
 34 #   of the following (let it run in the background): 
 35 #    
 36 #   ./wheresmahintertubes.py & 
 37 #   ./wheresmahintertubes.py --speak & 
 38 # 
 39 #   If you want some proper fun sounds then all you need is point them out: 
 40 # 
 41 #   ./wheresmahintertubes.py --yay=file/for_on.ogg --boo=file/for_off.mp3 & 
 42 # 
 43 # License: 
 44 #   Copyright (C) 2010 Konrad Siek <konrad.siek@gmail.com> 
 45 # 
 46 #   This program is free software: you can redistribute it and/or modify it  
 47 #   under the terms of the GNU General Public License version 3, as published  
 48 #   by the Free Software Foundation. 
 49 #  
 50 #   This program is distributed in the hope that it will be useful, but  
 51 #   WITHOUT ANY WARRANTY; without even the implied warranties of  
 52 #   MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR  
 53 #   PURPOSE.  See the GNU General Public License for more details. 
 54 #  
 55 #   You should have received a copy of the GNU General Public License along  
 56 #   with this program.  If not, see <http://www.gnu.org/licenses/>. 
 57 # 
 58  
 59 import time
 60 import sys
 61 import pygame
 62  
 63 quiet = False 
 64 uri = "http://74.125.77.147" 
 65  
 66 def printerr(*args):
 67     if quiet:
 68         return 
 69     from sys import argv, stderr
 70     from os.path import basename
 71     stderr.write("%s:" % basename(argv[0]))
 72     for arg in args:
 73         stderr.write(" %s" % arg)
 74     stderr.write("\n")
 75  
 76 def printout(*args):
 77     if quiet:
 78         return 
 79     from sys import argv, stdout
 80     from os.path import basename
 81     stdout.write("%s:" % basename(argv[0]))
 82     for arg in args:
 83         stdout.write(" %s" % arg)
 84     stdout.write("\n")
 85  
 86 def can_has_connection(timeout):
 87     import urllib2
 88     printout("Checking ping to", uri)
 89     try:
 90         urllib2.urlopen(uri, timeout=timeout) # Google 
 91     except urllib2.URLError as error:
 92         return False 
 93     return True 
 94  
 95 class Speaker:
 96     def __init__(self):
 97         self.library = {}
 98  
 99     def add_to_library(self, key, path):
100         self.library[key] = path
101  
102     def play_from_library(self, key):
103         if not key in self.library:
104             return False 
105         self.play(self.library[key])
106         return True 
107  
108     def play(self, message):
109         from os import system
110         system('espeak %s' % message)
111  
112 class PyGamePlayer:
113     def __init__(self):
114         pygame.init()
115         self.library = {}
116  
117     def add_to_library(self, key, path):
118         self.library[key] = path
119  
120     def play_from_library(self, key):
121         if not key in self.library:
122             return False 
123         self.play(self.library[key])
124         return True 
125  
126     def play(self, path):
127         pygame.mixer.Sound(path).play()
128  
129 class Checker:
130     def __init__(self, delay, timeout):
131         self.delay = delay
132         self.timeout = timeout
133  
134     def run(self):
135         connected = can_has_connection(self.timeout)
136         printout('Initially the connection is', 'on' if connected else 'off')
137         while True:
138             time.sleep(self.delay)
139             current = can_has_connection(self.timeout)
140             if connected != current:
141                 connected = current
142                 self.react(connected)
143  
144     def react(self, connected):
145         from threading import Thread
146         printout('The connection just went', 'on' if connected else 'off')
147         def run():
148             self.player.play_from_library(connected)
149         thread = Thread()
150         thread.run = run
151         thread.start()
152  
153 if __name__ == '__main__':
154     from optparse import OptionParser
155     from os.path import basename
156     from sys import argv
157  
158     usage = '\n%s [OPTIONS] ' % basename(argv[0]) + \ 
159         '--on=[SOUND FILE] --off=[SOUND FILE]\n' + \ 
160         '\tplay a sound when network connection goes up or down' + \ 
161         '\n%s [OPTIONS] ' % basename(argv[0]) + '--speak\n' + \ 
162         '\trecite a message when network connection goes up or down (boring...)' 
163  
164     description = 'Wait around and periodically check if the connection ' + \ 
165         'went up or down, and if that happens play an appropriate sound to ' + \ 
166         'indicate it to the user. Network connectivity is check by the ' + \ 
167         'the simple method of connecting a specific host address, and ' + \ 
168         'assuming that the entwork is down if it takes too much time for ' + \ 
169         'that host to respond.' 
170  
171     parser = OptionParser(usage=usage, description=description)
172  
173     parser.add_option('--speak', action='store_true', dest='speak', \ 
174         help='recite a message when connection goes on or off (default)')
175     parser.add_option('--off', '--boo', action='store', dest='audio_off', \ 
176         help='set a sound played when connection is lost')
177     parser.add_option('--on', '--yay', action='store', dest='audio_on', \ 
178         help='set a sound played when connection is back on')
179     parser.add_option('-t', '--timeout', action='store', dest='timeout', \ 
180         help='set timeout for checking if connection works (default: 10s)', \ 
181         default=10, type='int')
182     parser.add_option('-d', '--delay', action='store', dest='delay', \ 
183         help='set delay between connection checks (default: 30s)', \ 
184         default=30, type='int')
185     parser.add_option('-u', '--uri', '--url', action='store', dest='uri', \ 
186         help='ping this URI to see if connection works (default: %s)' % uri, \ 
187         default=uri)
188     parser.add_option('-v', '--verbose', action='store_true', dest='verbose', \ 
189         help='display information about things done by the program')
190  
191     opts, args = parser.parse_args()
192  
193     quiet = not opts.verbose
194     uri = opts.uri
195  
196     player = None 
197  
198     if opts.speak or not opts.audio_on or not opts.audio_off:
199         player = Speaker()
200         player.add_to_library(True, "Connection just went up")
201         player.add_to_library(False, "Connection just went down")
202     else:
203         player = PyGamePlayer()
204         player.add_to_library(True, opts.audio_on)
205         player.add_to_library(False, opts.audio_off)
206  
207     checker = Checker(delay=opts.delay, timeout=opts.timeout)
208     checker.player = player
209  
210     try:
211         checker.run()
212     except (KeyboardInterrupt, SystemExit):
213         printout('Exiting...')
214         running = False

The code is also available on GitHub at python/wheresmahintertubes.py.

Wednesday, July 7, 2010

OCD

OCD - the obsessive-compulsive daemon for checking websites for changes...

Have you ever waited for results of an exam or something that were meant to appear on a website somewhere? This script is supposed to spare you pressing CTRL+R every minute, by downloading the page every so often and comparing it with some prior version.

I actually don't do that sort of thing much anymore, but I'd seen some people (well, students) rapidly refreshing recently and felt that I could try to save them from themselves.

Anyway, onto the technical stuff.

Whereas doing that in a bash script using wget and whatnot will probably take around 20 lines or so, I really wanted to add support for Basic authentication and for selecting just parts of the website for comparison with XPath. The first of these can be done with wget but the latter requires a bit more work in bash in my experience. Also, I figured if I did it in Python maybe it'd be a bit more portable, in case somebody actually wants to use the script but for some strange reason wants to use that icky bashless operating system (and cannot install cygwin as well).

The modus operandi is rather simple: first, the original version is downloaded (line 215), then a giant while-true loop (line 224) downloads a new version (230) every so-and-so seconds (line 226). Depending on whether it's supposed to use xpath or not the download will use a different function (the one on line 104or the one on line 120) and will return different types of objects (text or list-but this is mostly unimportant). The download functions either use straightforward urllib2 to open a resource directly, or build an opener with a username and password, if these are defined. Everything else is pretty much decoration... you know - parameter parsing, communication with the user.

And then, there's the code:


  1 #!/usr/bin/env python 
  2 # -*- coding: utf-8 -*- 
  3 # 
  4 # Obsessive-compulsive daemon. 
  5 #  
  6 # The obsessive-compulsive daemon for website checking. It checks the website 
  7 # religiously every three minutes so you don't have to! Never again do you have 
  8 # to toil in front of the monitor for week after that tricky exam just pressing  
  9 # 'Refresh': employ our machine slaves to do it for you! No hassle*! 
 10 # 
 11 # The script is supposed to run in the background and download the webpage under 
 12 # scrutiny with a given frequency and compare it with the version it downloaded 
 13 # the first time. If a change is detected, the script runs a custom command. 
 14 # 
 15 # * Unless they organize a bloody revolution to overthrow their meaty 
 16 #   oppressors... and rightfully so. 
 17 # 
 18 # Depends: 
 19 #   µTidylib <apt:python-utidylib> 
 20 # 
 21 # Usage: 
 22 #   ocd [OPTIONS] URI 
 23 # 
 24 # Options: 
 25 #  -c COMMAND, --command=COMMAND 
 26 #                      Specify a command to run on change. 
 27 #  -f FORMAT, --format=FORMAT 
 28 #                      Specify how arguments are passed to the command. 
 29 #                      Available placeholders: %uri - the URI of the observed 
 30 #                      website, %old - original content, %new - changed 
 31 #                      content. 
 32 #  -x XPATH, --xpath=XPATH 
 33 #                      Provide a path to interesting elements in the observed 
 34 #                      document. 
 35 #  -s SECONDS, --sleep-time=SECONDS 
 36 #                      Set the time in seconds between checks (downloads). 
 37 #  --continue          Do not stop when change is detected. Instead run 
 38 #                      specified command and continue checking. 
 39 #  -u USER, --user=USER 
 40 #                      Specify a username for the website. 
 41 #  -P PASSWORD, --pass=PASSWORD, --password=PASSWORD 
 42 #                      Give a password for the website. 
 43 #  -p, --prompt        Prompt for login and password. 
 44 #  -v, --verbose       Print script progress information to stderr. 
 45 #  -h, --help          Show detailed usage information. 
 46 # 
 47 # Examples: 
 48 #   The simplest use-case: watch for changes in a webpage (as a whole) and check 
 49 #   it every minute. The ampersand will cause it to be run in the background in  
 50 #   bash. 
 51 # 
 52 #   ./ocd.py http://www.example.com/ & 
 53 # 
 54 #   Debugging and such: same as above, but loudmouth mode. 
 55 # 
 56 #   ./ocd.py -v http://www.example.com/ & 
 57 # 
 58 #   Watch a page for changes in a specific table element, with a login and  
 59 #   password, and checking it every 20 seconds. 
 60 #    
 61 #   ./ocd.py -ps 20 -x "//td[@class='strong']/div/span" \ 
 62 #      "https://usosweb.amu.edu.pl/kontroler.php?_action=actionx:dla_stud/studia/oceny/index()" & 
 63 #    
 64 #   Run a specific command on change: here, speak that the URI changed. 
 65 # 
 66 #   ./ocd.py -c espeak -f "Changes found in resource: %uri" \ 
 67 #      http://www.cs.put.poznan.pl/ksiek/sop/resources/embrace_change.php & 
 68 # 
 69 #   Here's a hint: if you run something in the commandline in the background 
 70 #   using the ampersand you can then turn the terminal off by pressing CTRL + D. 
 71 #  
 72 # License: 
 73 #   Copyright (C) 2010 Konrad Siek <konrad.siek@gmail.com> 
 74 # 
 75 #   This program is free software: you can redistribute it and/or modify it  
 76 #   under the terms of the GNU General Public License version 3, as published  
 77 #   by the Free Software Foundation. 
 78 #  
 79 #   This program is distributed in the hope that it will be useful, but  
 80 #   WITHOUT ANY WARRANTY; without even the implied warranties of  
 81 #   MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR  
 82 #   PURPOSE.  See the GNU General Public License for more details. 
 83 #  
 84 #   You should have received a copy of the GNU General Public License along  
 85 #   with this program.  If not, see <http://www.gnu.org/licenses/>. 
 86 # 
 87  
 88 # Internalization. 
 89 import gettext
 90 from gettext import gettext as _
 91 gettext.textdomain('ocd')
 92  
 93 def _ssl_opener(uri, user, password):
 94     """ Prepare an open function to handle the specific URI with Basic  
 95     authentication with and user/password.""" 
 96  
 97     password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
 98     password_manager.add_password(None, uri, user, password)
 99     handler = urllib2.HTTPBasicAuthHandler(password_manager)
100     opener = urllib2.build_opener(handler)
101  
102     return opener.open 
103  
104 def download(uri, user=None, password=None):
105     """ Dowloads the specified document and cleans it up.""" 
106  
107     import urllib2, tidy
108     from StringIO import StringIO
109  
110     opener = urllib2.urlopen if None in [user, password] \ 
111                              else _ssl_opener(uri, user, password)
112     raw_resource = ''.join(opener(uri).readlines())
113     tidy_doc = tidy.parseString(raw_resource, output_xhtml=1, add_xml_decl=1,
114                                             indent=1, output_encoding='utf8')
115     resource = StringIO()
116     tidy_doc.write(resource)
117  
118     return resource.getvalue()
119  
120 def downloadx(uri, xpath, user=None, password=None, getcontent=True):
121     """ Downloads the specified xpath elements from the document. 
122      
123     If getcontent is set to True, the element list will be converted to string 
124     before returning, otherwise a list of xmlNode objects is returned.""" 
125  
126     import libxml2
127  
128     resource = download(uri)
129     document = libxml2.htmlParseDoc(resource, None)
130     document.xpathNewContext()
131     elements = document.xpathEval(xpath)
132  
133     if getcontent:
134         return map(lambda e: e.get_content(), elements)
135     else:
136         return elements
137  
138 def getcredentials(user=None):
139     """ Prompt the user for whatever credentials are still missing. 
140      
141     If the username is given then just a prompt for password is shown to the  
142     user, nd otherwise, the user is asked for both the password and username.""" 
143  
144     from sys import stderr, stdin
145     from getpass import getpass
146  
147     if user is None:
148         stderr.write(_('Username: '))
149         user = stdin.readline().strip()
150     password = getpass(_('Password: '), stream=stderr)
151  
152     return user, password
153  
154 def _print(string, verbose):
155     """ A shorthand to print debugging information out if the verbose option is 
156     set to True. This pparticular implementation is a bit costly.""" 
157  
158     if verbose:
159         from sys import stderr, argv
160         from os.path import basename
161         stderr.write("%s: %s\n" % (basename(argv[0]), string));
162  
163 def run(uri, effect, user=None, password=None, verbose=False, xpath=None,
164         sleeptime=60, stopondifference=True, comparator=lambda x, y: x == y,
165         prompt=False):
166     """ The main part of the script: runs comparisons in a loop. 
167      
168     Here, credentials are gathered, the original version of the observed  
169     resource is downloaded, and the script sleeps for the specified time,  
170     downloads and compares new versions of the resource and checks for changes. 
171      
172     The URI specifies the address of the resource, or webpage, or whatever, that 
173     will be observed to find changes. 
174      
175     The effect parameter is a function that will be run when a change is found. 
176     This should be a function that takes three arguments: URI, old content, and  
177     new content. The old content and new content arguments may be of type List  
178     (if xpath is used) or String (if it is not used). 
179      
180     Sleep time may be specified in seconds, controlling the time the loop will  
181     wait between downloading each new (potentially changed) version of the page. 
182      
183     An XPath query can be given that specifies the elements that should be  
184     compared with version changes instead of a whole page. (Information on  
185     XPath: http://www.w3schools.com/XPath/.) 
186      
187     If verbose is set, debugging messages are produced on stderr. 
188      
189     If stopondifference is set, the script will stop running if a change is  
190     found, otherwise, the changed version becomes the new original and checking 
191     is continued. 
192      
193     If prompt is set, the user is asked for password and login if necessary;  
194     otherwise the login and password are used as they are. 
195      
196     A custom comparator may be specified. The comparator takes two arguments of  
197     type list (if xpath is used) or string (if it isn't) and should return a 
198     boolean.""" 
199  
200     from time import sleep
201  
202     # Ask the user for password if necessary.         
203     if prompt:
204         getcredentials(user)
205  
206     # Prepare a shorthand function for downloading versions. 
207     def _retrieve():
208         if xpath:
209             return downloadx(uri, xpath, user, password)
210         else:
211             return download(uri, user, password)
212  
213     # Download the original version of the remote resource. 
214     _print(_("Downloading base version of resource %s.") % uri, verbose)
215     comparison = _retrieve()
216  
217     print comparison
218  
219     if verbose and xpath:
220         # Just print some stuff out to stderr. 
221         elements = ', '.join(map(lambda x: '"%s"' % x, comparison))
222         _print(_("Elements at XPath %s: %s.") % (xpath, elements), verbose)
223  
224     while True:
225         _print(_("Sleeping %s seconds.") % sleeptime, verbose)
226         sleep(sleeptime)
227  
228         # Grab a more current version of the resource. 
229         _print(_("Downloading resource %s.") % uri, verbose)
230         current = _retrieve()
231  
232         if verbose and xpath:
233             # Just print some stuff out to stderr. 
234             elements = ', '.join(map(lambda x: '"%s"' % x, current))
235             _print(_("Elements at XPath %s: %s.") % (xpath, elements), verbose)
236  
237         # Compare old and new version of the resource. 
238         if not comparator(comparison, current):
239             # Run  
240             _print(_("Resource changed!"), verbose)
241             effect(uri, comparison, current)
242  
243             if stopondifference:
244                 break 
245             else:
246                 comparison = current
247                 continue 
248  
249         _print(_("No changes."), verbose)
250  
251 def create_handler_command(name, format):
252     """ Create a function that wraps a command for use with the main loop.""" 
253  
254     from os import popen
255     from sys import stdout
256  
257     def run_command(uri, old, new):
258         formatted = format 
259         formatted = formatted.replace("%uri", str(uri))
260         formatted = formatted.replace("%old", str(old))
261         formatted = formatted.replace("%new", str(new))
262         command = "%s %s" % (name, formatted)
263         content = ''.join(popen(command).readlines())
264         stdout.write(content)
265         return content
266  
267     return run_command
268  
269 if __name__ == '__main__':
270     """ Parse commandline options and start checking for changes.""" 
271  
272     from optparse import OptionParser, OptionGroup
273     from sys import argv, exit
274     from os.path import basename
275  
276     # Prepare the parser. 
277     usage = '%s [OPTIONS] URI' % basename(argv[0])
278     parser = OptionParser(usage=usage)
279  
280     # Options that control the process of checking the website. 
281     querying = OptionGroup(parser, 'Querying options')
282     querying.add_option('-c', '--command',  metavar='COMMAND', dest='command',
283         default='echo', help='Specify a command to run on change.')
284     querying.add_option('-f', '--format',  metavar='FORMAT', dest='format',
285          help='Specify how arguments are passed to the command. ' +
286         'Available placeholders: ' + '%uri - the URI of the observed website, ' 
287         '%old - original content, ' + '%new - changed content.', default='%uri')
288     querying.add_option('-x', '--xpath',  metavar='XPATH', dest='xpath',
289         help='Provide a path to interesting elements in the observed document.',
290         default=None)
291     querying.add_option('-s', '--sleep-time',  metavar='SECONDS', dest='sleep',
292         help='Set the time in seconds between checks (downloads).', default=60,
293         type='float')
294     querying.add_option('--continue', action='store_true',
295         default=False, help='Do not stop when change is detected. Instead ' +
296             'run specified command and continue checking.', dest='notstop')
297     parser.add_option_group(querying)
298  
299     # SSL and authentication options. 
300     security = OptionGroup(parser, 'Security options')
301     security.add_option('-u', '--user',  metavar='USER', dest="user", \ 
302         default=None, help='Specify a username for the website.')
303     security.add_option('-P', '--pass', '--password', metavar='PASSWORD', \ 
304         dest="password", default=None, help='Give a password for the website.')
305     security.add_option('-p', '--prompt', dest='prompt', action="store_true",
306         default=False, help='Prompt for login and password.')
307     parser.add_option_group(security)
308  
309     # Options that don't fit into other categories. 
310     other = OptionGroup(parser, "Other options")
311     other.add_option('-v', '--verbose', dest='verbose', action="store_true",
312         default=False, help='Print script progress information to stderr.')
313     parser.add_option_group(other)
314  
315     opts, args = parser.parse_args()
316  
317     # Check arguments 
318     if len(args) < 1:
319         _print(_("Nothing to check: quitting..."), opts.verbose)
320         exit(0)
321  
322     if len(args) > 1:
323         arguments = ', '.join(args[1:])
324         _print(_("Arguments %s are ignored.") % arguments, opts.verbose)
325  
326     # Let us begin to commence! 
327     try:
328         run(
329             args[0], create_handler_command(opts.command, opts.format),
330             user=opts.user, password=opts.password, prompt=opts.prompt,
331             verbose=opts.verbose, sleeptime=opts.sleep,
332             stopondifference=(not opts.notstop), xpath=opts.xpath
333         )
334     except (KeyboardInterrupt, SystemExit):
335        _print(_("Closed by the user."), opts.verbose)
336

The code is also available on GitHub at python/ocd.py.

Content

These are some scripts that I cough up every once in a while in a batch of creative frenzy.

They usually are begotten by way of me thinking 'Hey, if I make a script for it, I won't have to do it by hand!'. There's no point to them, though, because I could've done that thing 17 times by hand, by the time I finish the script...

They usually serve a specific purpose, like removing every second plosive consonant from a text file... well, maybe not that bad. But, I figure they may come in handy to someone, as the world is full of strange people.

Also, I hope that if I put them here they won't just end their lives in my /forget/where/i/put/it directory.