Monday, November 10, 2008

Word count

Hello. Long time, no see.

Well, I had no script idead and I've been busy. Alas, here's a short script for people enjoying the challenge of NaNoWriMo.

There's of course a couple of ways to count words. If you're using some sort of office suite it's probably built in, so no problem. If you're using LaTeX, like me (because I have a LaTeX fetish) you might have it in the tool you're using too, but it's less likely.

But you want to count words anyway, so what do you do?

Well, first of all, use the Linux wc command. It does well and there are no problems. Also, to get rid of the LaTeX code from the file you can use the untex tool, which has a ton of options to choose from to have a personalized and accurate experience of removing TeX tags from the code. You just read the tex files, and save the output somewhere...

So, most of what I did was to put it all together, like so:
 
1 #!/bin/bash
2 output=raw.txt
3 rm -f $output
4 for file in `ls chapter-*.tex`
5 do
6 untex -e $file >> $output
7 done
8 echo -e "Word count: \n wc\t$(cat $output | wc -w ) \n awk\t$(./count.awk $output)";


I set it up, so it only reads in files, whose names start with 'chapter-' and end with '.tex', because that is just the structure I use. However, the change to any other convention can easily be applied in line number 4 by parameterizing the ls command differently.

Additionally, it produces a raw.txt file as a side effect, which contains the actual text, which got the words counted, so if you want to verify untex or any of the counting mechanisms, you can do that easily.

Also, if you look closely, you will see that in line 8 there's something extra. I call an AWK script to provide some other word count. Here's how the script looks like inside:

#!/usr/bin/awk -f
{
    for (= 1; i <= NF; i++) {
        word = $i;
        #insert punctuation here, between the square brackets.
        n = split(word, a, /[-,.?!~`';:"'|\/@#$%^&*_-+={}\[\]<>()]+/); 
        for (= 1 ; j <= n; j++) {
            if (a[j] !~ /^[ \t\n]*$/) {                
                words++;
10            }
11        }
12    }
13}
14
15BEGIN {
16    words = 0;
17}
18
19END {
20    print words;
21}


What it actually does, is count the words, but unlike the wc command, it tries to recognize punctuation, and split words by the punctuation as well, so that hyphenated words are split. Also, it finds out stuff like long hyphens (LaTeX: '--') and removes them, so they are no longer counted as words.

I don't know which one is more accurate, but between the two, I can always have an optimistic and a pessimistic assumption about how many words I wrote.

The code is also available at GitHub as awk/count.awk
and bash/word_count.

3 comments:

Anonymous said...

Oh my. I've come across a problem in the count.awk file. Here's the error message I get when using the make_raw:

"
./count.awk: line 3: syntax error near unexpected token `('
./count.awk: line 3: ` for (i = 1; i <= NF; i++) {'
"

Any idea, guv? Am I doing something stupid?

Kondziu said...

This is strange, it seems to be working for me.

Let's see now...

Did you leave the top comment in? The line 3 it indicates seems to be my line 4. If not, the system doesn't know to use AWK as an interpreter and will try to use bash instead.

Anonymous said...

Indeed, as you very well know, it turned out the first line got butchered in the copy-paste operation. So the script itself is in fact fabulous and proper.

So I was, in fact, doing something stupid.