Gritty Scripts of Death: OCD

OCD - the obsessive-compulsive daemon for checking websites for changes...

Have you ever waited for results of an exam or something that were meant to appear on a website somewhere? This script is supposed to spare you pressing CTRL+R every minute, by downloading the page every so often and comparing it with some prior version.

I actually don't do that sort of thing much anymore, but I'd seen some people (well, students) rapidly refreshing recently and felt that I could try to save them from themselves.

Anyway, onto the technical stuff.

Whereas doing that in a bash script using wget and whatnot will probably take around 20 lines or so, I really wanted to add support for Basic authentication and for selecting just parts of the website for comparison with XPath. The first of these can be done with wget but the latter requires a bit more work in bash in my experience. Also, I figured if I did it in Python maybe it'd be a bit more portable, in case somebody actually wants to use the script but for some strange reason wants to use that icky bashless operating system (and cannot install cygwin as well).

The modus operandi is rather simple: first, the original version is downloaded (line 215), then a giant while-true loop (line 224) downloads a new version (230) every so-and-so seconds (line 226). Depending on whether it's supposed to use xpath or not the download will use a different function (the one on line 104or the one on line 120) and will return different types of objects (text or list-but this is mostly unimportant). The download functions either use straightforward urllib2 to open a resource directly, or build an opener with a username and password, if these are defined. Everything else is pretty much decoration... you know - parameter parsing, communication with the user.

And then, there's the code:


  1 #!/usr/bin/env python 
  2 # -*- coding: utf-8 -*- 
  3 # 
  4 # Obsessive-compulsive daemon. 
  5 #  
  6 # The obsessive-compulsive daemon for website checking. It checks the website 
  7 # religiously every three minutes so you don't have to! Never again do you have 
  8 # to toil in front of the monitor for week after that tricky exam just pressing  
  9 # 'Refresh': employ our machine slaves to do it for you! No hassle*! 
 10 # 
 11 # The script is supposed to run in the background and download the webpage under 
 12 # scrutiny with a given frequency and compare it with the version it downloaded 
 13 # the first time. If a change is detected, the script runs a custom command. 
 14 # 
 15 # * Unless they organize a bloody revolution to overthrow their meaty 
 16 #   oppressors... and rightfully so. 
 17 # 
 18 # Depends: 
 19 #   µTidylib <apt:python-utidylib> 
 20 # 
 21 # Usage: 
 22 #   ocd [OPTIONS] URI 
 23 # 
 24 # Options: 
 25 #  -c COMMAND, --command=COMMAND 
 26 #                      Specify a command to run on change. 
 27 #  -f FORMAT, --format=FORMAT 
 28 #                      Specify how arguments are passed to the command. 
 29 #                      Available placeholders: %uri - the URI of the observed 
 30 #                      website, %old - original content, %new - changed 
 31 #                      content. 
 32 #  -x XPATH, --xpath=XPATH 
 33 #                      Provide a path to interesting elements in the observed 
 34 #                      document. 
 35 #  -s SECONDS, --sleep-time=SECONDS 
 36 #                      Set the time in seconds between checks (downloads). 
 37 #  --continue          Do not stop when change is detected. Instead run 
 38 #                      specified command and continue checking. 
 39 #  -u USER, --user=USER 
 40 #                      Specify a username for the website. 
 41 #  -P PASSWORD, --pass=PASSWORD, --password=PASSWORD 
 42 #                      Give a password for the website. 
 43 #  -p, --prompt        Prompt for login and password. 
 44 #  -v, --verbose       Print script progress information to stderr. 
 45 #  -h, --help          Show detailed usage information. 
 46 # 
 47 # Examples: 
 48 #   The simplest use-case: watch for changes in a webpage (as a whole) and check 
 49 #   it every minute. The ampersand will cause it to be run in the background in  
 50 #   bash. 
 51 # 
 52 #   ./ocd.py http://www.example.com/ & 
 53 # 
 54 #   Debugging and such: same as above, but loudmouth mode. 
 55 # 
 56 #   ./ocd.py -v http://www.example.com/ & 
 57 # 
 58 #   Watch a page for changes in a specific table element, with a login and  
 59 #   password, and checking it every 20 seconds. 
 60 #    
 61 #   ./ocd.py -ps 20 -x "//td[@class='strong']/div/span" \ 
 62 #      "https://usosweb.amu.edu.pl/kontroler.php?_action=actionx:dla_stud/studia/oceny/index()" & 
 63 #    
 64 #   Run a specific command on change: here, speak that the URI changed. 
 65 # 
 66 #   ./ocd.py -c espeak -f "Changes found in resource: %uri" \ 
 67 #      http://www.cs.put.poznan.pl/ksiek/sop/resources/embrace_change.php & 
 68 # 
 69 #   Here's a hint: if you run something in the commandline in the background 
 70 #   using the ampersand you can then turn the terminal off by pressing CTRL + D. 
 71 #  
 72 # License: 
 73 #   Copyright (C) 2010 Konrad Siek <konrad.siek@gmail.com> 
 74 # 
 75 #   This program is free software: you can redistribute it and/or modify it  
 76 #   under the terms of the GNU General Public License version 3, as published  
 77 #   by the Free Software Foundation. 
 78 #  
 79 #   This program is distributed in the hope that it will be useful, but  
 80 #   WITHOUT ANY WARRANTY; without even the implied warranties of  
 81 #   MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR  
 82 #   PURPOSE.  See the GNU General Public License for more details. 
 83 #  
 84 #   You should have received a copy of the GNU General Public License along  
 85 #   with this program.  If not, see <http://www.gnu.org/licenses/>. 
 86 # 
 87  
 88 # Internalization. 
 89 import gettext
 90 from gettext import gettext as _
 91 gettext.textdomain('ocd')
 92  
 93 def _ssl_opener(uri, user, password):
 94     """ Prepare an open function to handle the specific URI with Basic  
 95     authentication with and user/password.""" 
 96  
 97     password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
 98     password_manager.add_password(None, uri, user, password)
 99     handler = urllib2.HTTPBasicAuthHandler(password_manager)
100     opener = urllib2.build_opener(handler)
101  
102     return opener.open 
103  
104 def download(uri, user=None, password=None):
105     """ Dowloads the specified document and cleans it up.""" 
106  
107     import urllib2, tidy
108     from StringIO import StringIO
109  
110     opener = urllib2.urlopen if None in [user, password] \ 
111                              else _ssl_opener(uri, user, password)
112     raw_resource = ''.join(opener(uri).readlines())
113     tidy_doc = tidy.parseString(raw_resource, output_xhtml=1, add_xml_decl=1,
114                                             indent=1, output_encoding='utf8')
115     resource = StringIO()
116     tidy_doc.write(resource)
117  
118     return resource.getvalue()
119  
120 def downloadx(uri, xpath, user=None, password=None, getcontent=True):
121     """ Downloads the specified xpath elements from the document. 
122      
123     If getcontent is set to True, the element list will be converted to string 
124     before returning, otherwise a list of xmlNode objects is returned.""" 
125  
126     import libxml2
127  
128     resource = download(uri)
129     document = libxml2.htmlParseDoc(resource, None)
130     document.xpathNewContext()
131     elements = document.xpathEval(xpath)
132  
133     if getcontent:
134         return map(lambda e: e.get_content(), elements)
135     else:
136         return elements
137  
138 def getcredentials(user=None):
139     """ Prompt the user for whatever credentials are still missing. 
140      
141     If the username is given then just a prompt for password is shown to the  
142     user, nd otherwise, the user is asked for both the password and username.""" 
143  
144     from sys import stderr, stdin
145     from getpass import getpass
146  
147     if user is None:
148         stderr.write(_('Username: '))
149         user = stdin.readline().strip()
150     password = getpass(_('Password: '), stream=stderr)
151  
152     return user, password
153  
154 def _print(string, verbose):
155     """ A shorthand to print debugging information out if the verbose option is 
156     set to True. This pparticular implementation is a bit costly.""" 
157  
158     if verbose:
159         from sys import stderr, argv
160         from os.path import basename
161         stderr.write("%s: %s\n" % (basename(argv[0]), string));
162  
163 def run(uri, effect, user=None, password=None, verbose=False, xpath=None,
164         sleeptime=60, stopondifference=True, comparator=lambda x, y: x == y,
165         prompt=False):
166     """ The main part of the script: runs comparisons in a loop. 
167      
168     Here, credentials are gathered, the original version of the observed  
169     resource is downloaded, and the script sleeps for the specified time,  
170     downloads and compares new versions of the resource and checks for changes. 
171      
172     The URI specifies the address of the resource, or webpage, or whatever, that 
173     will be observed to find changes. 
174      
175     The effect parameter is a function that will be run when a change is found. 
176     This should be a function that takes three arguments: URI, old content, and  
177     new content. The old content and new content arguments may be of type List  
178     (if xpath is used) or String (if it is not used). 
179      
180     Sleep time may be specified in seconds, controlling the time the loop will  
181     wait between downloading each new (potentially changed) version of the page. 
182      
183     An XPath query can be given that specifies the elements that should be  
184     compared with version changes instead of a whole page. (Information on  
185     XPath: http://www.w3schools.com/XPath/.) 
186      
187     If verbose is set, debugging messages are produced on stderr. 
188      
189     If stopondifference is set, the script will stop running if a change is  
190     found, otherwise, the changed version becomes the new original and checking 
191     is continued. 
192      
193     If prompt is set, the user is asked for password and login if necessary;  
194     otherwise the login and password are used as they are. 
195      
196     A custom comparator may be specified. The comparator takes two arguments of  
197     type list (if xpath is used) or string (if it isn't) and should return a 
198     boolean.""" 
199  
200     from time import sleep
201  
202     # Ask the user for password if necessary.         
203     if prompt:
204         getcredentials(user)
205  
206     # Prepare a shorthand function for downloading versions. 
207     def _retrieve():
208         if xpath:
209             return downloadx(uri, xpath, user, password)
210         else:
211             return download(uri, user, password)
212  
213     # Download the original version of the remote resource. 
214     _print(_("Downloading base version of resource %s.") % uri, verbose)
215     comparison = _retrieve()
216  
217     print comparison
218  
219     if verbose and xpath:
220         # Just print some stuff out to stderr. 
221         elements = ', '.join(map(lambda x: '"%s"' % x, comparison))
222         _print(_("Elements at XPath %s: %s.") % (xpath, elements), verbose)
223  
224     while True:
225         _print(_("Sleeping %s seconds.") % sleeptime, verbose)
226         sleep(sleeptime)
227  
228         # Grab a more current version of the resource. 
229         _print(_("Downloading resource %s.") % uri, verbose)
230         current = _retrieve()
231  
232         if verbose and xpath:
233             # Just print some stuff out to stderr. 
234             elements = ', '.join(map(lambda x: '"%s"' % x, current))
235             _print(_("Elements at XPath %s: %s.") % (xpath, elements), verbose)
236  
237         # Compare old and new version of the resource. 
238         if not comparator(comparison, current):
239             # Run  
240             _print(_("Resource changed!"), verbose)
241             effect(uri, comparison, current)
242  
243             if stopondifference:
244                 break 
245             else:
246                 comparison = current
247                 continue 
248  
249         _print(_("No changes."), verbose)
250  
251 def create_handler_command(name, format):
252     """ Create a function that wraps a command for use with the main loop.""" 
253  
254     from os import popen
255     from sys import stdout
256  
257     def run_command(uri, old, new):
258         formatted = format 
259         formatted = formatted.replace("%uri", str(uri))
260         formatted = formatted.replace("%old", str(old))
261         formatted = formatted.replace("%new", str(new))
262         command = "%s %s" % (name, formatted)
263         content = ''.join(popen(command).readlines())
264         stdout.write(content)
265         return content
266  
267     return run_command
268  
269 if __name__ == '__main__':
270     """ Parse commandline options and start checking for changes.""" 
271  
272     from optparse import OptionParser, OptionGroup
273     from sys import argv, exit
274     from os.path import basename
275  
276     # Prepare the parser. 
277     usage = '%s [OPTIONS] URI' % basename(argv[0])
278     parser = OptionParser(usage=usage)
279  
280     # Options that control the process of checking the website. 
281     querying = OptionGroup(parser, 'Querying options')
282     querying.add_option('-c', '--command',  metavar='COMMAND', dest='command',
283         default='echo', help='Specify a command to run on change.')
284     querying.add_option('-f', '--format',  metavar='FORMAT', dest='format',
285          help='Specify how arguments are passed to the command. ' +
286         'Available placeholders: ' + '%uri - the URI of the observed website, ' 
287         '%old - original content, ' + '%new - changed content.', default='%uri')
288     querying.add_option('-x', '--xpath',  metavar='XPATH', dest='xpath',
289         help='Provide a path to interesting elements in the observed document.',
290         default=None)
291     querying.add_option('-s', '--sleep-time',  metavar='SECONDS', dest='sleep',
292         help='Set the time in seconds between checks (downloads).', default=60,
293         type='float')
294     querying.add_option('--continue', action='store_true',
295         default=False, help='Do not stop when change is detected. Instead ' +
296             'run specified command and continue checking.', dest='notstop')
297     parser.add_option_group(querying)
298  
299     # SSL and authentication options. 
300     security = OptionGroup(parser, 'Security options')
301     security.add_option('-u', '--user',  metavar='USER', dest="user", \ 
302         default=None, help='Specify a username for the website.')
303     security.add_option('-P', '--pass', '--password', metavar='PASSWORD', \ 
304         dest="password", default=None, help='Give a password for the website.')
305     security.add_option('-p', '--prompt', dest='prompt', action="store_true",
306         default=False, help='Prompt for login and password.')
307     parser.add_option_group(security)
308  
309     # Options that don't fit into other categories. 
310     other = OptionGroup(parser, "Other options")
311     other.add_option('-v', '--verbose', dest='verbose', action="store_true",
312         default=False, help='Print script progress information to stderr.')
313     parser.add_option_group(other)
314  
315     opts, args = parser.parse_args()
316  
317     # Check arguments 
318     if len(args) < 1:
319         _print(_("Nothing to check: quitting..."), opts.verbose)
320         exit(0)
321  
322     if len(args) > 1:
323         arguments = ', '.join(args[1:])
324         _print(_("Arguments %s are ignored.") % arguments, opts.verbose)
325  
326     # Let us begin to commence! 
327     try:
328         run(
329             args[0], create_handler_command(opts.command, opts.format),
330             user=opts.user, password=opts.password, prompt=opts.prompt,
331             verbose=opts.verbose, sleeptime=opts.sleep,
332             stopondifference=(not opts.notstop), xpath=opts.xpath
333         )
334     except (KeyboardInterrupt, SystemExit):
335        _print(_("Closed by the user."), opts.verbose)
336

The code is also available on GitHub at python/ocd.py.

Gritty Scripts of Death

Wednesday, July 7, 2010

OCD

No comments: