Friday, September 18, 2009

Concordancer

Concordancers are a sort of pet project for me - I'm often in the process of making one. They're simple enough to be fun, and complex enough to be interesting.

An additional perk is that nobody knows what concordancers are. If want to know about them, you probably want to start here and dig on.

This particular instance is meant to be very simple, and passes any pre- and post-processing straight onto the user. The text is split into words just by whitespaces, for instance, so all the punctuation sticks to words, and distorts the result, but if you want that fixed, you have to do it before passing it on to the concordancer.

Ok, so here's a quick walkthrough.

Basically, the concordancer read text from stdin and the keyword list from the arguments. Here's an example of searching or the word Sword in the text file de-bello-galico.txt:

cat de-bello-gallico.txt | ./concordancer.py Sword

That should output something like:

then much reduced by the *sword* of the enemy, and by
rest, perish either by the *sword* or by famine." XXXI.--They rise
rushes on briskly with his *sword* and carries on the combat
Therefore, having put to the *sword* the garrison of Noviodunum and
had advanced up the hill *sword* in hand, and had forced
labour, should put to the *sword* all the grown-up inhabitants, as
made a blow with his *sword* at his naked shoulder and
by the wound of a *sword* in the mouth; nor was

Actually, an even simpler invocation is available, if you want to create conordances for all the words in the text - in that case, you needn't provide a list of keywords, and go:

cat de-bello-gallico.txt | ./concordancer.py

... but I'm not sure how useful that'll be to you.

Typically, you'd probably want to find concordances for a word in all its forms. You can do that using aspell to generate the list of keywords from a given root:

cat de-bello-gallico.txt | ./concordancer.py `aspell dump master | grep sword`

And that'll produce output for all words containing the substring sword.

Now, there are probably better ways to use aspell for that purpose, but honestly, I played around with it and this is the only one that got me the result I wanted...

You can play around with different formats too, by just converting them to text prior to creating the concordance. For instance, for PDFs, use pdftotext:

pdftotext de-bello-gallico.pdf - | ./concordancer.py `aspell dump master | grep sword`

Right. You can also play around with the output of the concordancer. By default it marks the keywords in concordances with asterisks, but you can change that, to e.g. some HTML tags, by going:

cat de-bello-gallico.txt | ./concordancer.py -p '<b>' -s '</b>' Sword

And that'll produce something like:

then much reduced by the <b>sword> of the enemy, and by
rest, perish either by the <b>sword> or by famine." XXXI.--They rise
...

Another thing you can do is change the size of the context, here to up to 3 words on each side.:

cat de-bello-gallico.txt | ./concordancer.py -c 3 Sword 

That will output:

reduced by the *sword* of the enemy,
either by the *sword* or by famine."
briskly with his *sword* and carries on
put to the *sword* the garrison of
up the hill *sword* in hand, and
put to the *sword* all the grown-up
blow with his *sword* at his naked
wound of a *sword* in the mouth;

Also, you can group the output by keywords:

cat de-bello-gallico.txt | ./concordancer.py -d group reserves declares

And that gives you something like this:

*reserves*:
neither could proper *reserves* be posted, nor
*declares*:
suddenly assaulted; he *declares* himself ready to
that council he *declares* Cingetorix, the leader
Hispania Baetica, _Carmone_; *declares* for Caesar, and

Enough rambling, here's the code:
 
1 #!/usr/bin/python
2 #
3 # Concordancer
4 #
5 # A script for finding concordances for given keywords in the
6 # specified text.
7 #
8 # A concordance is a keyword with its context (here, the closest
9 # n words), a combination used, for instance, in lexicography to
10 # deduce the meaning of the keyword based on the way it is used
11 # in text.
12 #
13 # Parameters:
14 # c - the number of words that surround a keyword in context
15 # p - the string that is stuck in front of keywords
16 # s - the string that is stuck at the ends of keywords
17 # d - formatting of the display,
18 # 'simple' - one concordance per line (default)
19 # 'group' - group concordances by keywords
20 #
21 # Example:
22 # to find concordances for the word 'list' in the bash manual:
23 # man bash | concordancer.py arguments options
24 #
25 # Author:
26 # Konrad Siek <konrad.siek@gmail.com>
27 #
28 # License:
29 #
30 # Copyright 2009 Konrad Siek
31 #
32 # This program is free software: you can redistribute it and/or modify
33 # it under the terms of the GNU General Public License as published by
34 # the Free Software Foundation, either version 3 of the License, or
35 # (at your option) any later version.
36 #
37 # This program is distributed in the hope that it will be useful,
38 # but WITHOUT ANY WARRANTY; without even the implied warranty of
39 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
40 # GNU General Public License for more details.
41 #
42 # You should have received a copy of the GNU General Public License
43 # along with this program. If not, see <http://www.gnu.org/licenses/>.
44
45 # Imports.
46 import getopt
47 import sys
48
49 # Option sigils - the characters associated with various options.
50 CONTEXT_SIZE = 'c'
51 PREFIX = 'p'
52 SUFFIX = 's'
53 DISPLAY = 'd'
54
55 # Option default values, represented as a map for convenience.
56 OPTIONS = {\
57 CONTEXT_SIZE: str(5), \
58 PREFIX: '*', \
59 SUFFIX: '*', \
60 DISPLAY: 'simple'\
61 }
62
63 # Character constants, also for convenience.
64 EMPTY = ""
65 SPACE = " "
66 NEWLINE = "\n"
67 TAB = "\t"
68 COLON = ":"
69 SWITCH = "-"
70
71 def display_help(program_name):
72 """ Display usage information.
73
74 @param program_name - the name of the script"""
75
76 help_string = \
77 """Usage:
78 %s [OPTION] ... [WORD] ...
79 Options:
80 %s the number of words that surround a keyword in context
81 %s the string that is stuck in front of keywords
82 %s the string that is stuck at the ends of keywords
83 %s formatting of the display,
84 'simple' - one concordance per line (default)
85 'group' - group concordances by keywords
86 Words:
87 The list of words that concordances will be searched for. If
88 no list is provided, a complete concordance is made - that is,
89 one using all input words.""" \
90 % (program_name, CONTEXT_SIZE, PREFIX, SUFFIX, DISPLAY)
91 print(help_string)
92
93 def find_concordances(keywords, words, context_size):
94 """ Finds concordances for keywords in a list of input words.
95
96 @param keywords - list of keywords,
97 @param words - input text as a list of words
98 @param context_size - number of words that should surround a keyword
99 @return list of concordances"""
100
101 # Initialize the concordance map with empty lists, for each keyword.
102 concordances = prep_concordance_map(keywords)
103
104 # If any word in the text matches a keyword, create a concordance.
105 for i in range(0, len(words)):
106 for keyword in keywords:
107 if matches(keyword, words[i]):
108 concordance = form_concordance(words, i, context_size)
109 concordances[keyword].append(concordance)
110
111 return concordances
112
113 def find_all_concordances(words, context_size):
114 """ Make a complete concordance - assume all words match.
115
116 @param words - input text as a list of words
117 @param context_size - number of words that should surround a keyword
118 @return list of concordances"""
119
120 concordances = {}
121
122 for i in range(0, len(words)):
123 word = words[i]
124 if word not in concordances:
125 concordances[word] = []
126 concordance = form_concordance(words, i, context_size)
127 concordances[word].append(concordance)
128
129 return concordances
130
131 def print_concordances(concordances, simple, prefix, suffix):
132 """ Print the concordances to screen.
133
134 @param concordances - list of concordances to display
135 @param simple - True: display only concordances, False: group by keywords
136 @param prefix - prefix to keywords
137 @param suffix - suffix to keywords"""
138
139 # For each concordance, mark the keywords in the sentence and print it out.
140 for keyword in concordances:
141 if not simple:
142 sys.stdout.write(prefix + keyword + suffix + COLON + NEWLINE)
143 for words in concordances[keyword]:
144 if not simple:
145 sys.stdout.write(TAB)
146 for i in range(0, len(words)):
147 if matches(keyword, words[i]):
148 sys.stdout.write(prefix + words[i] + suffix)
149 else:
150 sys.stdout.write(words[i])
151 if i < len(words) - 1:
152 sys.stdout.write(SPACE)
153 else:
154 sys.stdout.write(NEWLINE)
155
156 def prep_concordance_map(dict_words):
157 """ Prepare a map with keywords as keys and empty lists as values.
158
159 @param dict_words - list of keywords"""
160
161 # Put an empty list value for each keyword as key.
162 concordances = {}
163 for word in dict_words:
164 concordances[word] = []
165
166 return concordances
167
168 def matches(word_a, word_b):
169 """ Case insensitive string equivalence.
170
171 @param word_a - first string
172 @param word_b - second string (duh)
173 @return True or False"""
174
175 return word_a.lower() == word_b.lower()
176
177 def form_concordance(words, occurance, context_size):
178 """ Creates a concordance.
179
180 @param words - list of all input words
181 @param occurance - index of keyword in input list
182 @param context_size - number of preceding and following words
183 @return a sublist of the input words"""
184
185 start = occurance - context_size
186 if start < 0:
187 start = 0
188
189 return words[start : occurance + context_size + 1]
190
191 def read_stdin():
192 """ Read everything from standard input as a list.
193
194 @return list of strings"""
195
196 words = []
197 for line in sys.stdin:
198 # Add all elements returned by function to words.
199 words.extend(line.split())
200
201 return words
202
203 def read_option(key, options, default):
204 """ Get an option from a map, or use a default.
205
206 @param key - option key
207 @param options - option map
208 @param default - default value, used if the map does not contain that key
209 @return value from the map or default"""
210
211 for option, value in options:
212 if (option == SWITCH + key):
213 return value
214
215 return default
216
217 def get_configuration(arguments):
218 """ Retrieve the entire configuration of the script.
219
220 @param arguments - script runtime parameters
221 @return map of options with defaults included
222 @return list of arguments (keywords)
223 @return list of words from standard input"""
224
225 # All possible option sigils are concatenated into an option string.
226 option_string = EMPTY.join([("%s" + COLON) % i for i in OPTIONS.keys()])
227 # Read all the options.
228 options, arguments = getopt.getopt(arguments, option_string)
229
230 # Apply default values if no values were set.
231 fixed_options = {}
232 for key in OPTIONS.keys():
233 fixed_options[key] = read_option(key, options, OPTIONS[key])
234
235 # Read the list of words at standard input.
236 input = read_stdin()
237
238 return (fixed_options, arguments, input)
239
240 def process(options, arguments, input):
241 """ The main function.
242
243 @param options - map of options with defaults included
244 @param arguments - list of arguments (keywords)
245 @param input - list of words from standard input"""
246
247 # Extract some key option values.
248 context_size = int(options[CONTEXT_SIZE])
249 simple = options[DISPLAY] == OPTIONS[DISPLAY]
250
251 # Conduct main processing - find the concordances.
252 concordances = {}
253 if arguments == []:
254 # If no arguments are specified, construct a concordance for all
255 # possible keywords.
256 concordances = find_all_concordances(input, context_size)
257 else:
258 # And if there are,make a concordance for only those words.
259 concordances = find_concordances(arguments, input, context_size)
260
261 # Display the results.
262 print_concordances(concordances, simple, options[PREFIX], options[SUFFIX])
263
264 # The processing starts here.
265 if __name__ == '__main__':
266 # Read all user-supplied information.
267 options, arguments, input = get_configuration(sys.argv[1:])
268
269 # The configuration is not full - display usage information.
270 if arguments == [] and input == []:
271 display_help(sys.argv[0])
272 exit(1)
273
274 # If evverything is in order, start concordancing.
275 process(options, arguments, input)
276


The code is also available at GitHub as python/concordancer.py.

No comments: