Wednesday, July 7, 2010

OCD

OCD - the obsessive-compulsive daemon for checking websites for changes...

Have you ever waited for results of an exam or something that were meant to appear on a website somewhere? This script is supposed to spare you pressing CTRL+R every minute, by downloading the page every so often and comparing it with some prior version.

I actually don't do that sort of thing much anymore, but I'd seen some people (well, students) rapidly refreshing recently and felt that I could try to save them from themselves.

Anyway, onto the technical stuff.

Whereas doing that in a bash script using wget and whatnot will probably take around 20 lines or so, I really wanted to add support for Basic authentication and for selecting just parts of the website for comparison with XPath. The first of these can be done with wget but the latter requires a bit more work in bash in my experience. Also, I figured if I did it in Python maybe it'd be a bit more portable, in case somebody actually wants to use the script but for some strange reason wants to use that icky bashless operating system (and cannot install cygwin as well).

The modus operandi is rather simple: first, the original version is downloaded (line 215), then a giant while-true loop (line 224) downloads a new version (230) every so-and-so seconds (line 226). Depending on whether it's supposed to use xpath or not the download will use a different function (the one on line 104or the one on line 120) and will return different types of objects (text or list-but this is mostly unimportant). The download functions either use straightforward urllib2 to open a resource directly, or build an opener with a username and password, if these are defined. Everything else is pretty much decoration... you know - parameter parsing, communication with the user.

And then, there's the code:


1 #!/usr/bin/env python
2 # -*- coding: utf-8 -*-
3 #
4 # Obsessive-compulsive daemon.
5 #
6 # The obsessive-compulsive daemon for website checking. It checks the website
7 # religiously every three minutes so you don't have to! Never again do you have
8 # to toil in front of the monitor for week after that tricky exam just pressing
9 # 'Refresh': employ our machine slaves to do it for you! No hassle*!
10 #
11 # The script is supposed to run in the background and download the webpage under
12 # scrutiny with a given frequency and compare it with the version it downloaded
13 # the first time. If a change is detected, the script runs a custom command.
14 #
15 # * Unless they organize a bloody revolution to overthrow their meaty
16 # oppressors... and rightfully so.
17 #
18 # Depends:
19 # µTidylib <apt:python-utidylib>
20 #
21 # Usage:
22 # ocd [OPTIONS] URI
23 #
24 # Options:
25 # -c COMMAND, --command=COMMAND
26 # Specify a command to run on change.
27 # -f FORMAT, --format=FORMAT
28 # Specify how arguments are passed to the command.
29 # Available placeholders: %uri - the URI of the observed
30 # website, %old - original content, %new - changed
31 # content.
32 # -x XPATH, --xpath=XPATH
33 # Provide a path to interesting elements in the observed
34 # document.
35 # -s SECONDS, --sleep-time=SECONDS
36 # Set the time in seconds between checks (downloads).
37 # --continue Do not stop when change is detected. Instead run
38 # specified command and continue checking.
39 # -u USER, --user=USER
40 # Specify a username for the website.
41 # -P PASSWORD, --pass=PASSWORD, --password=PASSWORD
42 # Give a password for the website.
43 # -p, --prompt Prompt for login and password.
44 # -v, --verbose Print script progress information to stderr.
45 # -h, --help Show detailed usage information.
46 #
47 # Examples:
48 # The simplest use-case: watch for changes in a webpage (as a whole) and check
49 # it every minute. The ampersand will cause it to be run in the background in
50 # bash.
51 #
52 # ./ocd.py http://www.example.com/ &
53 #
54 # Debugging and such: same as above, but loudmouth mode.
55 #
56 # ./ocd.py -v http://www.example.com/ &
57 #
58 # Watch a page for changes in a specific table element, with a login and
59 # password, and checking it every 20 seconds.
60 #
61 # ./ocd.py -ps 20 -x "//td[@class='strong']/div/span" \
62 # "https://usosweb.amu.edu.pl/kontroler.php?_action=actionx:dla_stud/studia/oceny/index()" &
63 #
64 # Run a specific command on change: here, speak that the URI changed.
65 #
66 # ./ocd.py -c espeak -f "Changes found in resource: %uri" \
67 # http://www.cs.put.poznan.pl/ksiek/sop/resources/embrace_change.php &
68 #
69 # Here's a hint: if you run something in the commandline in the background
70 # using the ampersand you can then turn the terminal off by pressing CTRL + D.
71 #
72 # License:
73 # Copyright (C) 2010 Konrad Siek <konrad.siek@gmail.com>
74 #
75 # This program is free software: you can redistribute it and/or modify it
76 # under the terms of the GNU General Public License version 3, as published
77 # by the Free Software Foundation.
78 #
79 # This program is distributed in the hope that it will be useful, but
80 # WITHOUT ANY WARRANTY; without even the implied warranties of
81 # MERCHANTABILITY, SATISFACTORY QUALITY, or FITNESS FOR A PARTICULAR
82 # PURPOSE. See the GNU General Public License for more details.
83 #
84 # You should have received a copy of the GNU General Public License along
85 # with this program. If not, see <http://www.gnu.org/licenses/>.
86 #
87
88 # Internalization.
89 import gettext
90 from gettext import gettext as _
91 gettext.textdomain('ocd')
92
93 def _ssl_opener(uri, user, password):
94 """ Prepare an open function to handle the specific URI with Basic
95 authentication with and user/password."""
96
97 password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
98 password_manager.add_password(None, uri, user, password)
99 handler = urllib2.HTTPBasicAuthHandler(password_manager)
100 opener = urllib2.build_opener(handler)
101
102 return opener.open
103
104 def download(uri, user=None, password=None):
105 """ Dowloads the specified document and cleans it up."""
106
107 import urllib2, tidy
108 from StringIO import StringIO
109
110 opener = urllib2.urlopen if None in [user, password] \
111 else _ssl_opener(uri, user, password)
112 raw_resource = ''.join(opener(uri).readlines())
113 tidy_doc = tidy.parseString(raw_resource, output_xhtml=1, add_xml_decl=1,
114 indent=1, output_encoding='utf8')
115 resource = StringIO()
116 tidy_doc.write(resource)
117
118 return resource.getvalue()
119
120 def downloadx(uri, xpath, user=None, password=None, getcontent=True):
121 """ Downloads the specified xpath elements from the document.
122
123 If getcontent is set to True, the element list will be converted to string
124 before returning, otherwise a list of xmlNode objects is returned."""
125
126 import libxml2
127
128 resource = download(uri)
129 document = libxml2.htmlParseDoc(resource, None)
130 document.xpathNewContext()
131 elements = document.xpathEval(xpath)
132
133 if getcontent:
134 return map(lambda e: e.get_content(), elements)
135 else:
136 return elements
137
138 def getcredentials(user=None):
139 """ Prompt the user for whatever credentials are still missing.
140
141 If the username is given then just a prompt for password is shown to the
142 user, nd otherwise, the user is asked for both the password and username."""
143
144 from sys import stderr, stdin
145 from getpass import getpass
146
147 if user is None:
148 stderr.write(_('Username: '))
149 user = stdin.readline().strip()
150 password = getpass(_('Password: '), stream=stderr)
151
152 return user, password
153
154 def _print(string, verbose):
155 """ A shorthand to print debugging information out if the verbose option is
156 set to True. This pparticular implementation is a bit costly."""
157
158 if verbose:
159 from sys import stderr, argv
160 from os.path import basename
161 stderr.write("%s: %s\n" % (basename(argv[0]), string));
162
163 def run(uri, effect, user=None, password=None, verbose=False, xpath=None,
164 sleeptime=60, stopondifference=True, comparator=lambda x, y: x == y,
165 prompt=False):
166 """ The main part of the script: runs comparisons in a loop.
167
168 Here, credentials are gathered, the original version of the observed
169 resource is downloaded, and the script sleeps for the specified time,
170 downloads and compares new versions of the resource and checks for changes.
171
172 The URI specifies the address of the resource, or webpage, or whatever, that
173 will be observed to find changes.
174
175 The effect parameter is a function that will be run when a change is found.
176 This should be a function that takes three arguments: URI, old content, and
177 new content. The old content and new content arguments may be of type List
178 (if xpath is used) or String (if it is not used).
179
180 Sleep time may be specified in seconds, controlling the time the loop will
181 wait between downloading each new (potentially changed) version of the page.
182
183 An XPath query can be given that specifies the elements that should be
184 compared with version changes instead of a whole page. (Information on
185 XPath: http://www.w3schools.com/XPath/.)
186
187 If verbose is set, debugging messages are produced on stderr.
188
189 If stopondifference is set, the script will stop running if a change is
190 found, otherwise, the changed version becomes the new original and checking
191 is continued.
192
193 If prompt is set, the user is asked for password and login if necessary;
194 otherwise the login and password are used as they are.
195
196 A custom comparator may be specified. The comparator takes two arguments of
197 type list (if xpath is used) or string (if it isn't) and should return a
198 boolean."""
199
200 from time import sleep
201
202 # Ask the user for password if necessary.
203 if prompt:
204 getcredentials(user)
205
206 # Prepare a shorthand function for downloading versions.
207 def _retrieve():
208 if xpath:
209 return downloadx(uri, xpath, user, password)
210 else:
211 return download(uri, user, password)
212
213 # Download the original version of the remote resource.
214 _print(_("Downloading base version of resource %s.") % uri, verbose)
215 comparison = _retrieve()
216
217 print comparison
218
219 if verbose and xpath:
220 # Just print some stuff out to stderr.
221 elements = ', '.join(map(lambda x: '"%s"' % x, comparison))
222 _print(_("Elements at XPath %s: %s.") % (xpath, elements), verbose)
223
224 while True:
225 _print(_("Sleeping %s seconds.") % sleeptime, verbose)
226 sleep(sleeptime)
227
228 # Grab a more current version of the resource.
229 _print(_("Downloading resource %s.") % uri, verbose)
230 current = _retrieve()
231
232 if verbose and xpath:
233 # Just print some stuff out to stderr.
234 elements = ', '.join(map(lambda x: '"%s"' % x, current))
235 _print(_("Elements at XPath %s: %s.") % (xpath, elements), verbose)
236
237 # Compare old and new version of the resource.
238 if not comparator(comparison, current):
239 # Run
240 _print(_("Resource changed!"), verbose)
241 effect(uri, comparison, current)
242
243 if stopondifference:
244 break
245 else:
246 comparison = current
247 continue
248
249 _print(_("No changes."), verbose)
250
251 def create_handler_command(name, format):
252 """ Create a function that wraps a command for use with the main loop."""
253
254 from os import popen
255 from sys import stdout
256
257 def run_command(uri, old, new):
258 formatted = format
259 formatted = formatted.replace("%uri", str(uri))
260 formatted = formatted.replace("%old", str(old))
261 formatted = formatted.replace("%new", str(new))
262 command = "%s %s" % (name, formatted)
263 content = ''.join(popen(command).readlines())
264 stdout.write(content)
265 return content
266
267 return run_command
268
269 if __name__ == '__main__':
270 """ Parse commandline options and start checking for changes."""
271
272 from optparse import OptionParser, OptionGroup
273 from sys import argv, exit
274 from os.path import basename
275
276 # Prepare the parser.
277 usage = '%s [OPTIONS] URI' % basename(argv[0])
278 parser = OptionParser(usage=usage)
279
280 # Options that control the process of checking the website.
281 querying = OptionGroup(parser, 'Querying options')
282 querying.add_option('-c', '--command', metavar='COMMAND', dest='command',
283 default='echo', help='Specify a command to run on change.')
284 querying.add_option('-f', '--format', metavar='FORMAT', dest='format',
285 help='Specify how arguments are passed to the command. ' +
286 'Available placeholders: ' + '%uri - the URI of the observed website, '
287 '%old - original content, ' + '%new - changed content.', default='%uri')
288 querying.add_option('-x', '--xpath', metavar='XPATH', dest='xpath',
289 help='Provide a path to interesting elements in the observed document.',
290 default=None)
291 querying.add_option('-s', '--sleep-time', metavar='SECONDS', dest='sleep',
292 help='Set the time in seconds between checks (downloads).', default=60,
293 type='float')
294 querying.add_option('--continue', action='store_true',
295 default=False, help='Do not stop when change is detected. Instead ' +
296 'run specified command and continue checking.', dest='notstop')
297 parser.add_option_group(querying)
298
299 # SSL and authentication options.
300 security = OptionGroup(parser, 'Security options')
301 security.add_option('-u', '--user', metavar='USER', dest="user", \
302 default=None, help='Specify a username for the website.')
303 security.add_option('-P', '--pass', '--password', metavar='PASSWORD', \
304 dest="password", default=None, help='Give a password for the website.')
305 security.add_option('-p', '--prompt', dest='prompt', action="store_true",
306 default=False, help='Prompt for login and password.')
307 parser.add_option_group(security)
308
309 # Options that don't fit into other categories.
310 other = OptionGroup(parser, "Other options")
311 other.add_option('-v', '--verbose', dest='verbose', action="store_true",
312 default=False, help='Print script progress information to stderr.')
313 parser.add_option_group(other)
314
315 opts, args = parser.parse_args()
316
317 # Check arguments
318 if len(args) < 1:
319 _print(_("Nothing to check: quitting..."), opts.verbose)
320 exit(0)
321
322 if len(args) > 1:
323 arguments = ', '.join(args[1:])
324 _print(_("Arguments %s are ignored.") % arguments, opts.verbose)
325
326 # Let us begin to commence!
327 try:
328 run(
329 args[0], create_handler_command(opts.command, opts.format),
330 user=opts.user, password=opts.password, prompt=opts.prompt,
331 verbose=opts.verbose, sleeptime=opts.sleep,
332 stopondifference=(not opts.notstop), xpath=opts.xpath
333 )
334 except (KeyboardInterrupt, SystemExit):
335 _print(_("Closed by the user."), opts.verbose)
336

The code is also available on GitHub at python/ocd.py.