Friday, September 18, 2009

Concordancer

Concordancers are a sort of pet project for me - I'm often in the process of making one. They're simple enough to be fun, and complex enough to be interesting.

An additional perk is that nobody knows what concordancers are. If want to know about them, you probably want to start here and dig on.

This particular instance is meant to be very simple, and passes any pre- and post-processing straight onto the user. The text is split into words just by whitespaces, for instance, so all the punctuation sticks to words, and distorts the result, but if you want that fixed, you have to do it before passing it on to the concordancer.

Ok, so here's a quick walkthrough.

Basically, the concordancer read text from stdin and the keyword list from the arguments. Here's an example of searching or the word Sword in the text file de-bello-galico.txt:

cat de-bello-gallico.txt | ./concordancer.py Sword

That should output something like:

then much reduced by the *sword* of the enemy, and by
rest, perish either by the *sword* or by famine." XXXI.--They rise
rushes on briskly with his *sword* and carries on the combat
Therefore, having put to the *sword* the garrison of Noviodunum and
had advanced up the hill *sword* in hand, and had forced
labour, should put to the *sword* all the grown-up inhabitants, as
made a blow with his *sword* at his naked shoulder and
by the wound of a *sword* in the mouth; nor was

Actually, an even simpler invocation is available, if you want to create conordances for all the words in the text - in that case, you needn't provide a list of keywords, and go:

cat de-bello-gallico.txt | ./concordancer.py

... but I'm not sure how useful that'll be to you.

Typically, you'd probably want to find concordances for a word in all its forms. You can do that using aspell to generate the list of keywords from a given root:

cat de-bello-gallico.txt | ./concordancer.py `aspell dump master | grep sword`

And that'll produce output for all words containing the substring sword.

Now, there are probably better ways to use aspell for that purpose, but honestly, I played around with it and this is the only one that got me the result I wanted...

You can play around with different formats too, by just converting them to text prior to creating the concordance. For instance, for PDFs, use pdftotext:

pdftotext de-bello-gallico.pdf - | ./concordancer.py `aspell dump master | grep sword`

Right. You can also play around with the output of the concordancer. By default it marks the keywords in concordances with asterisks, but you can change that, to e.g. some HTML tags, by going:

cat de-bello-gallico.txt | ./concordancer.py -p '<b>' -s '</b>' Sword

And that'll produce something like:

then much reduced by the <b>sword> of the enemy, and by
rest, perish either by the <b>sword> or by famine." XXXI.--They rise
...

Another thing you can do is change the size of the context, here to up to 3 words on each side.:

cat de-bello-gallico.txt | ./concordancer.py -c 3 Sword 

That will output:

reduced by the *sword* of the enemy,
either by the *sword* or by famine."
briskly with his *sword* and carries on
put to the *sword* the garrison of
up the hill *sword* in hand, and
put to the *sword* all the grown-up
blow with his *sword* at his naked
wound of a *sword* in the mouth;

Also, you can group the output by keywords:

cat de-bello-gallico.txt | ./concordancer.py -d group reserves declares

And that gives you something like this:

*reserves*:
neither could proper *reserves* be posted, nor
*declares*:
suddenly assaulted; he *declares* himself ready to
that council he *declares* Cingetorix, the leader
Hispania Baetica, _Carmone_; *declares* for Caesar, and

Enough rambling, here's the code:
 
1 #!/usr/bin/python
2 #
3 # Concordancer
4 #
5 # A script for finding concordances for given keywords in the
6 # specified text.
7 #
8 # A concordance is a keyword with its context (here, the closest
9 # n words), a combination used, for instance, in lexicography to
10 # deduce the meaning of the keyword based on the way it is used
11 # in text.
12 #
13 # Parameters:
14 # c - the number of words that surround a keyword in context
15 # p - the string that is stuck in front of keywords
16 # s - the string that is stuck at the ends of keywords
17 # d - formatting of the display,
18 # 'simple' - one concordance per line (default)
19 # 'group' - group concordances by keywords
20 #
21 # Example:
22 # to find concordances for the word 'list' in the bash manual:
23 # man bash | concordancer.py arguments options
24 #
25 # Author:
26 # Konrad Siek <konrad.siek@gmail.com>
27 #
28 # License:
29 #
30 # Copyright 2009 Konrad Siek
31 #
32 # This program is free software: you can redistribute it and/or modify
33 # it under the terms of the GNU General Public License as published by
34 # the Free Software Foundation, either version 3 of the License, or
35 # (at your option) any later version.
36 #
37 # This program is distributed in the hope that it will be useful,
38 # but WITHOUT ANY WARRANTY; without even the implied warranty of
39 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
40 # GNU General Public License for more details.
41 #
42 # You should have received a copy of the GNU General Public License
43 # along with this program. If not, see <http://www.gnu.org/licenses/>.
44
45 # Imports.
46 import getopt
47 import sys
48
49 # Option sigils - the characters associated with various options.
50 CONTEXT_SIZE = 'c'
51 PREFIX = 'p'
52 SUFFIX = 's'
53 DISPLAY = 'd'
54
55 # Option default values, represented as a map for convenience.
56 OPTIONS = {\
57 CONTEXT_SIZE: str(5), \
58 PREFIX: '*', \
59 SUFFIX: '*', \
60 DISPLAY: 'simple'\
61 }
62
63 # Character constants, also for convenience.
64 EMPTY = ""
65 SPACE = " "
66 NEWLINE = "\n"
67 TAB = "\t"
68 COLON = ":"
69 SWITCH = "-"
70
71 def display_help(program_name):
72 """ Display usage information.
73
74 @param program_name - the name of the script"""
75
76 help_string = \
77 """Usage:
78 %s [OPTION] ... [WORD] ...
79 Options:
80 %s the number of words that surround a keyword in context
81 %s the string that is stuck in front of keywords
82 %s the string that is stuck at the ends of keywords
83 %s formatting of the display,
84 'simple' - one concordance per line (default)
85 'group' - group concordances by keywords
86 Words:
87 The list of words that concordances will be searched for. If
88 no list is provided, a complete concordance is made - that is,
89 one using all input words.""" \
90 % (program_name, CONTEXT_SIZE, PREFIX, SUFFIX, DISPLAY)
91 print(help_string)
92
93 def find_concordances(keywords, words, context_size):
94 """ Finds concordances for keywords in a list of input words.
95
96 @param keywords - list of keywords,
97 @param words - input text as a list of words
98 @param context_size - number of words that should surround a keyword
99 @return list of concordances"""
100
101 # Initialize the concordance map with empty lists, for each keyword.
102 concordances = prep_concordance_map(keywords)
103
104 # If any word in the text matches a keyword, create a concordance.
105 for i in range(0, len(words)):
106 for keyword in keywords:
107 if matches(keyword, words[i]):
108 concordance = form_concordance(words, i, context_size)
109 concordances[keyword].append(concordance)
110
111 return concordances
112
113 def find_all_concordances(words, context_size):
114 """ Make a complete concordance - assume all words match.
115
116 @param words - input text as a list of words
117 @param context_size - number of words that should surround a keyword
118 @return list of concordances"""
119
120 concordances = {}
121
122 for i in range(0, len(words)):
123 word = words[i]
124 if word not in concordances:
125 concordances[word] = []
126 concordance = form_concordance(words, i, context_size)
127 concordances[word].append(concordance)
128
129 return concordances
130
131 def print_concordances(concordances, simple, prefix, suffix):
132 """ Print the concordances to screen.
133
134 @param concordances - list of concordances to display
135 @param simple - True: display only concordances, False: group by keywords
136 @param prefix - prefix to keywords
137 @param suffix - suffix to keywords"""
138
139 # For each concordance, mark the keywords in the sentence and print it out.
140 for keyword in concordances:
141 if not simple:
142 sys.stdout.write(prefix + keyword + suffix + COLON + NEWLINE)
143 for words in concordances[keyword]:
144 if not simple:
145 sys.stdout.write(TAB)
146 for i in range(0, len(words)):
147 if matches(keyword, words[i]):
148 sys.stdout.write(prefix + words[i] + suffix)
149 else:
150 sys.stdout.write(words[i])
151 if i < len(words) - 1:
152 sys.stdout.write(SPACE)
153 else:
154 sys.stdout.write(NEWLINE)
155
156 def prep_concordance_map(dict_words):
157 """ Prepare a map with keywords as keys and empty lists as values.
158
159 @param dict_words - list of keywords"""
160
161 # Put an empty list value for each keyword as key.
162 concordances = {}
163 for word in dict_words:
164 concordances[word] = []
165
166 return concordances
167
168 def matches(word_a, word_b):
169 """ Case insensitive string equivalence.
170
171 @param word_a - first string
172 @param word_b - second string (duh)
173 @return True or False"""
174
175 return word_a.lower() == word_b.lower()
176
177 def form_concordance(words, occurance, context_size):
178 """ Creates a concordance.
179
180 @param words - list of all input words
181 @param occurance - index of keyword in input list
182 @param context_size - number of preceding and following words
183 @return a sublist of the input words"""
184
185 start = occurance - context_size
186 if start < 0:
187 start = 0
188
189 return words[start : occurance + context_size + 1]
190
191 def read_stdin():
192 """ Read everything from standard input as a list.
193
194 @return list of strings"""
195
196 words = []
197 for line in sys.stdin:
198 # Add all elements returned by function to words.
199 words.extend(line.split())
200
201 return words
202
203 def read_option(key, options, default):
204 """ Get an option from a map, or use a default.
205
206 @param key - option key
207 @param options - option map
208 @param default - default value, used if the map does not contain that key
209 @return value from the map or default"""
210
211 for option, value in options:
212 if (option == SWITCH + key):
213 return value
214
215 return default
216
217 def get_configuration(arguments):
218 """ Retrieve the entire configuration of the script.
219
220 @param arguments - script runtime parameters
221 @return map of options with defaults included
222 @return list of arguments (keywords)
223 @return list of words from standard input"""
224
225 # All possible option sigils are concatenated into an option string.
226 option_string = EMPTY.join([("%s" + COLON) % i for i in OPTIONS.keys()])
227 # Read all the options.
228 options, arguments = getopt.getopt(arguments, option_string)
229
230 # Apply default values if no values were set.
231 fixed_options = {}
232 for key in OPTIONS.keys():
233 fixed_options[key] = read_option(key, options, OPTIONS[key])
234
235 # Read the list of words at standard input.
236 input = read_stdin()
237
238 return (fixed_options, arguments, input)
239
240 def process(options, arguments, input):
241 """ The main function.
242
243 @param options - map of options with defaults included
244 @param arguments - list of arguments (keywords)
245 @param input - list of words from standard input"""
246
247 # Extract some key option values.
248 context_size = int(options[CONTEXT_SIZE])
249 simple = options[DISPLAY] == OPTIONS[DISPLAY]
250
251 # Conduct main processing - find the concordances.
252 concordances = {}
253 if arguments == []:
254 # If no arguments are specified, construct a concordance for all
255 # possible keywords.
256 concordances = find_all_concordances(input, context_size)
257 else:
258 # And if there are,make a concordance for only those words.
259 concordances = find_concordances(arguments, input, context_size)
260
261 # Display the results.
262 print_concordances(concordances, simple, options[PREFIX], options[SUFFIX])
263
264 # The processing starts here.
265 if __name__ == '__main__':
266 # Read all user-supplied information.
267 options, arguments, input = get_configuration(sys.argv[1:])
268
269 # The configuration is not full - display usage information.
270 if arguments == [] and input == []:
271 display_help(sys.argv[0])
272 exit(1)
273
274 # If evverything is in order, start concordancing.
275 process(options, arguments, input)
276


The code is also available at GitHub as python/concordancer.py.

Zentube

Another variation on the theme of zenity. I honestly like the way you can make simple front-ends. In addition, I'm doing something with youtube again, or more precisely, I'm doing stuff with youtube-dl.

So, the problem with youtube is that if you don't have Internet access, you obviously can't really use it, and there are certain instances where it'd come in handy. One such instance is when you're doing language teaching in an Internet-bereft classroom.

So there's youtube-dl to get some videos downloaded, but a person is not always in the mood for fiddling with the shell when preparing their lesson material.

Hence, this script provides the simples of interfaces to download videos via youtube-dl. That's pretty much it. Anyway, I think it's simple and does its job.

Oh, yeah, I played around with the idea of automatically installing a package if it is not available at the time of execution. It's a sort of experiment, to see if it can be done at all. I'm not sure how effective this is though. And it depends on apt and gksudo.

The code:
 
1 #!/bin/bash
2 #
3 # Zentube
4 #
5 # A simple GUI front-end to youtube-dl. All you need to do is run it,
6 # and put in the address of the video, and the back-end tries to
7 # download the video.
8 #
9 # Parameters:
10 # None
11 #
12 # Requires:
13 # youtube-dl
14 # zenity
15 # gksudo & apt (if you want youtube-dl installed automatically)
16 #
17 # Author:
18 # Konrad Siek <konrad.siek@gmail.com>
19 #
20 # License:
21 #
22 # Copyright 2008 Konrad Siek.
23 #
24 # This program is free software: you can redistribute it and/
25 # or modify it under the terms of the GNU General Public
26 # License as published by the Free Software Foundation, either
27 # version 3 of the License, or (at your option) any later
28 # version.
29 #
30 # This program is distributed in the hope that it will be
31 # useful, but WITHOUT ANY WARRANTY; without even the implied
32 # warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
33 # PURPOSE. See the GNU General Public License for more
34 # details.
35 #
36 # You should have received a copy of the GNU General Public
37 # License along with this program. If not, see
38 # <http://www.gnu.org/licenses/>.
39
40 # The downloader backend.
41 PACKAGE=youtube-dl
42
43 # Output information.
44 OUTPUT_DIR=~/Videos/
45 EXTENSION=.flv
46 TEMP_FILE=/tmp/$(basename $0).XXXXXXXXXX
47
48 # The quality of the output file can be adjusted here, or you can comment
49 # out this setting altogether, to get the default.
50 QUALITY=--best-quality
51
52 # Exit code constants.
53 SUCCESS=0
54 INSTALLATION_ABORTED=1
55 INSTALLATION_FAILED=2
56 INVALID_VIDEO_ADDRESS=4
57 INVALID_OUTPUT_DIRECTORY=8
58 BACKEND_ERROR=16
59
60 # This is a convenience installer for apt-using distros, e.g. Ubuntu.
61 if [ -z "$(which $PACKAGE)" ]
62 then
63 # Ask whether to attempt automatic install of the missing package.
64 # If the answer is no, then quit with an error.
65 zenity --question \
66 --title="Automatic installation" \
67 --text="Can't find <b>$PACKAGE</b>. Should I try installing it?" \
68 || exit $INSTALLATION_ABORTED
69
70 # Try installing the missing package, or quit with an error if the
71 # attempt is failed.
72 gksudo "apt-get install $PACKAGE" || exit $INSTALLATION_FAILED
73 fi
74
75 # Ask user for the URL of the video.
76 url=$(\
77 zenity --entry \
78 --title="Video address" \
79 --text="What is the address of the video?" \
80 )
81 # If no URL is given, then quit.
82 [ -z "$url" ] && exit $INVALID_VIDEO_ADDRESS
83
84 # Move to the output directory, create it i necessary.
85 mkdir -p "$OUTPUT_DIR" || exit $INVALID_OUTPUT_DIRECTORY
86 cd "$OUTPUT_DIR"
87
88 # Make a temporary file to collect error messages from the downloader.
89 temp_file=$(mktemp $TEMP_FILE)
90
91 # Run the downloader.
92 $PACKAGE $QUALITY --title "$url" 2>"$temp_file" | \
93 zenity --progress --pulsate --auto-kill --auto-close --text="Downloading..."
94
95 # Check for errors, and display a success of error dialog at the end.
96 errors=$(cat $temp_file)
97
98 if [ -z "$errors" ]
99 then
100 # Display successful info.
101 zenity --info --text="Download successful!"
102
103 # Remove temporary file.
104 unlink "$temp_file"
105
106 # Exit successfully.
107 exit $SUCCESS
108 else
109 # Display error dialog.
110 zenity --error --text="$errors"
111
112 # Remove temporary file.
113 unlink "$temp_file"
114
115 # Exit with an error code.
116 exit $BACKEND_ERROR
117 fi


The code is also available at GitHub as bash/zentube.

Wednesday, September 16, 2009

Read selection

And onto further adventures!

After making that zenspeak script I got told that it'd be more useful, if you could enter a whole lot of text into it. So then it dawned on me that maybe having a Gedit plugin like that could be useful.

So, if you got the External Tools plug-in installed in Gedit, you can put this script in there somewhere, tweak it a bit to use your favorite speech generator, voice, etc., and you're all set to never read a single word again.

One drawback: it won't be easy to stop it if you've let it run, so if you make it read a lot of text, it might be troublesome.

I suppose I don't have it in me to really write stuff here today.

The code:
 
1 #!/bin/bash
2 text=`xargs -0 echo`
3
4 SYSTEM=espeak
5
6 if [ -n "$text" ]
7 then
8 echo "Reading \"$text\" with $SYSTEM."
9
10 if [ $SYSTEM == espeak ]
11 then
12 padsp espeak "$text" -v en+f3
13 elif [ $SYSTEM == festival ]
14 then
15 echo "$text" | festival --tts
16 fi
17 fi


The code is also available at GitHub as gedit/gedit_read_selection.

Zenspeak

You can give this one to children. It makes them noisier.

This one's just a simple interface to either espeak or festival: it asks you what to say via zenity and then says it. It doesn't take any arguments, so you start it up with a simple:

./zenspeak

In summary, it's not exactly dragon magic.

The code:
 
1 #!/bin/bash
2 #
3 # Zenspeak
4 #
5 # Provides a simple graphical (Gtk) interface to a speech production system:
6 # either espeak or festival. It's really simple too: you put in some text,
7 # the text is spoken. When you put in zero text, the program ends.
8 #
9 # Parameters:
10 # None
11 #
12 # Depends:
13 # espeak (apt:espeak)
14 # festival (apt:festival)
15 # zenity (apt:zenity)
16 #
17 # Author:
18 # Konrad Siek <konrad.siek@gmail.com>
19 #
20 # License information:
21 #
22 # This program is free software: you can redistribute it and/or modify
23 # it under the terms of the GNU General Public License as published by
24 # the Free Software Foundation, either version 3 of the License, or
25 # (at your option) any later version.
26 #
27 # This program is distributed in the hope that it will be useful,
28 # but WITHOUT ANY WARRANTY; without even the implied warranty of
29 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
30 # GNU General Public License for more details.
31 #
32 # You should have received a copy of the GNU General Public License
33 # along with this program. If not, see <http://www.gnu.org/licenses/>.
34 #
35 # Copyright 2009 Konrad Siek
36
37 # System for production of sound is selected by the parameter,
38 # or the defaut is used if none were specified.
39 SYSTEM_DEFAULT=espeak
40 SYSTEM=`(( $# == 0 )) && echo "$SYSTEM_DEFAULT" || echo "$1"`
41 echo $SYSTEM
42
43 # System dependent settings for espeak:
44 espeak_speed=120 # default: 160
45 espeak_pitch=60 # 0-99, default: 50
46 espeak_amplitude=20 # 0-20, default: 10
47 espeak_voide=english # list of voices: `espeak --voices`
48 espeak_variant=f2 # m{1,6}, f{1,4}
49
50 # I'm completely pants when it comes to setting up festival, so I won't
51 # even attempt it here.
52
53 while true
54 do
55 # Show dialog and get user input.
56 something=`zenity --entry --title="Say something..." --text="Say:"`
57
58 # If no user input or cancel: bugger off (and indicate correct result).
59 if [ -z "$something" ]
60 then
61 exit 0
62 fi
63
64 # Put the input through either espeak or festival.
65 if [ "$SYSTEM" == "espeak" ]
66 then
67 # Note: the sound is padded within pulse, so that it can be
68 # played simultaneously with other sources.
69 padsp espeak \
70 -a $espeak_amplitude \
71 -p $espeak_pitch \
72 -s $espeak_speed \
73 -v $espeak_voice+$espeak_variant \
74 "$something"
75 elif [ "$SYSTEM" == "festival" ]
76 then
77 # Incidentally, that's all I know about festival.
78 echo "$something" | festival --tts
79 fi
80 done


The code is also available at GitHub as bash/zenspeak.