The Awk Programming Language

Posted by: Rea Maor In: Programming - Sunday, May 6th, 2007

Let’s settle some questions about Awk right up front:

  • The funny name is the last-name initials of its creators: Alfred Aho, Peter Weinberger, and Brian Kernighan.
  • Its syntax comes from C, and awk was a major inspiration for Perl.
  • It is not a reallygeneral-purpose programming language. That is, I wouldn’t try to use it to make a large application with. Like sed (which we won’t cover), it was designed just to be a small, efficient tool for manipulating text streams. It can do most of what other programming languages can do (such as defining function), but it shines its best when it’s called periodically to make a quick filter.
  • Like Lisp, it suffers from the multiple-version problem. There’s gawk (GNU awk), mawk (which uses a byte-code interpreter to improve speed), jawk (an awk done in Java), nawk (“new” awk), and a few others, including a component in the “busy-box” mini-utility. Incompatible standards exist for each, which annoys Unix users who may get a softlink named “awk” pointing to a different awk, so that it quietly tries to execute the scripts you ported from your old computer and fails in non-obvious ways.
  • Awk (at least in some form) is cross-platform. Definitely awk is available through DJGPP for Windows and various other varieties, it is a standard tool in any Unix-based distribution, and is a part of Darwin/OS X for the Mac.

Awk’s little gift to the world is in being the ideal tool for writing “one-liners”. Not programs per se, but simply commands you can twiddle out at the command line to do nifty things when you don’t have another tool to do the job.

An example: say Firefox froze up or crashed (it happens). From the command line, you’ll get numerous job ID’s for Firefox because it runs in multiple files at the same time. To kill them all in one stroke:

for NUMBER in $(ps aux | grep Firefox | awk '{print $2}'); do kill $NUMBER; done

As both Unix and Mac command-line users know, this is a classic pipe command. It starts with a Bash ‘for’ loop (for X in Y; do command; done), and the target of the for loop is a three-move combo (‘ps aux’ prints currently running jobs, ‘grep’ is told to only print those jobs with ‘Firefox’ in the line, and the awk call ‘print $2’ tells it to print only the second field of each of those lines, isolating the job number ID so the kill command can reap them). And you thought only Mortal Kombat had combo moves! The “|” is the pipe character, which takes input from the previous command and sends it to the next command.

Thus, those fantastically complex command lines you see floating around are actually a series of smaller commands, connected together like a set of Legos to build whatever special-purpose tool you need. Some people are turned off, but a special kind of user is quite drawn to them; knowing how to bat one of these out when needed feels a little bit like knowing a magic spell! You could even make an ‘alias’ command and stick it in your Bash profile, and thereafter simply call it by the name you gave it:

alias killfox="for NUMBER in $(ps aux | grep Firefox | awk \'{print $2}\'); do kill $NUMBER; done"

which would be ‘killfox’ in this case. Next, you could enter your custom ‘killfox’ command as an item in your desktop menu, reducing what initially was a tedious problem down to a single mouse-click.

While awk does have a nearly-full capability as a programming language, there are some features lacking (such as being able to include libraries) which prevent it from being a good choice as a general-purpose language. Instead, let’s just list a little “awk spell book”:

cat ~/myreport | awk -F: '{print $1}'

If you need to change the field separator to something besides a space or tab (the default from the ps example), the -F option followed by the needed character does it.

cat | awk '/echo/ {$0 = "#" $0}; {print $0}' | cat >

Let’s say you had a Bash script which you had printing state for debugging purposes, but now you’re ready to stop the unneeded printing. This will pipe the Bash script through the awk program which is using (surprise!) an “if:then:else” syntax. If the line contains an ‘echo’ statement, it will prepend a “# ” to the beginning of the line, else it will leave it alone. The last connector in the pipe will save the output in a new file. Neat, no?

cat file | awk 'BEGIN {lines = 0}; {lines++}; END {print lines}'

A line-count. We’re actually using a “for:do:done” syntax here; BEGIN starts a loop and initializes the ‘lines’ variable to zero, lines++ does just what you C programmers would expect it to do, and surprisingly enough, awk will automatically take it that the count is to end when the file’s end is reached; the END statement is only for the end of the commands within the loop. Note also that we use semi-colons to punctuate individual commands, just like C.

cat number_data | awk 'BEGIN {total=0}; {total=total+$1}; END {print total}'

Got a column of numbers you need in a file which you need the sum of? This uses a similar trick as above to do it.

awk 'BEGIN {print 2^8}'

Yep, you get 256 (2 to the 8th power)! It’s a command-line calculator, too! Note that we gave awk no input file, so we told it ‘BEGIN’ to start running without one.

cat accesslog_website_todays_date | awk '{print $1}' | sort | uniq | wc -l

This is assuming that your site prints access logs with the IP number first. Grab the access log from your website today, pipe it through this magic spell, get a head count of unique visitors. (‘cat’ pipes the input, awk restricts everything but the first field which are the IP addresses, ‘sort’ puts them in order, ‘uniq’ eliminates duplicates, ‘wc’ can count words, lines, or characters given the option.) An old trick known to webmasters everywhere.

cat accesslog | awk '/POST / {print $1};' | sort | uniq -c

Now *this* little gem, when run on your site’s access log, will only print the IP addresses of the visitors who posted to your site (such as by filling in a comment form), sort them, and use ‘uniq’ with the -c option, which will count how many copies of each element it encountered. Why? Well, so you can see if one IP tried posting 190 times per day to your blog’s comment form – might this be the little devil who’s been spamming you? Perhaps trying to hack your admin password?

Another Unix-native command, sed, is often mentioned in the same breath with awk (in fact, they get combined in the same book by O’Reilly press!), but actually sed is weaker than awk, and has a syntax that looks like a cartoon character swearing, being composed almost entirely of punctuation marks. Really, awk is the best way to go, as anything sed can do awk can do better – and clearer!

There’s many more awk one-liners here.

Well, we’ve poked into some dark, murky corners of the programming world, and have looked at some stuff that isn’t interesting to a lot of people. So we’ll finish our tour with Javascript and PHP, two languages which anybody working on the Internet should at least be familiar with!

Related Posts:

Leave a Reply