Awk

Awk is a programming language, at least as much as 'sh' is a programming language. It's a pattern matcher. It is good for a lot of file-munging kinds of tasks.

Its programming language looks a bit more like C, except that it's got a shell-ish syntactic approach; for example, newlines are significant. However, it does not have the shell's bizarre inverse quoting rules; strings do need to be in double-quotes. It is more like a normal programming language than the shell in terms of some of the basic syntax.

On the other hand, it is still a somewhat-specialized programming language.

An awk program is a loop over the lines.

{
    print
}

is like a 'cat'.

To execute this awk program, we could type: awk '{ print }'

Some of the newlines above are not actually required (and are not present when we type awk '{ print }'). We would only attempt this newline elimination for very short awk programs.

What looks like sh's positional parameters ($1 through $9, although there is no limit of 9 in awk) refers to fields on the current line. So for example,

awk '{ print $1 }'

outputs the first field of every input line (try it!).

Here is a slightly more exciting awk program. From here on in I will usually write awk programs so that they are actually shell scripts, that is including the invocation of awk so that we get something which we would pass to sh, not directly to awk. Sh will then invoke awk. We do this a lot, and we can throw in other cute interactions between sh and awk as illustrated eventually below.

awk '
$1 == "foo" {
    print
}
$1 == "bar" {
    print "bar"
}'

If a boolean expression ("if ()" contents) appears before the left brace, then that clause is only executed for lines for which the boolean expression is true. Thus this awk program prints "bar" (but nothing else) for each input line whose first field is "bar", and prints the entire input line if the first field is "foo". "print" with no parameters prints the entire input line. You can also refer to the entire current input line as "$0".

print emits an implied newline. Comma-separated things are separated by the separator character, which is space by default.

printf (as opposed to "print") doesn't have this implied newline. The awk "printf" statement has a first parameter which is the format string, like printf in the C library, and subsequent parameters are referred to by '%' items in the format string.

awk '
$1 == "foo" {
    print
}
$1 == "bar" {
    printf "bar %d\n", 2+3
}'

Let us now write an awk program which takes input which looks like:

3 4 Sally
5 12 Alan
1 9
6 8 Fred

and produces output like:

Sally 7
Alan 17
unknown 10
Fred 14

Here is an awk program to do that:

awk '
NF == 2 {
    print "unknown", $1 + $2
}
NF != 2 {
    print $3, $1 + $2
}'

Those expressions before the list of statements in braces can also be regular expression patterns;
they can be absent to match all lines (as done at the beginning above);
they can also be BEGIN or END to specify code to be executed one time before or after all line processing.

The following awk program sums a list of input numbers, one per line:

awk '
BEGIN {
    x = 0
}
{
    x += $1
}
END {
    print x
}'

As in sh, you don't declare variables; just start using them. However, since strings do require double-quotes, we don't need to have any variable introducer like sh's "$" (and indeed we don't have any such thing).

NF is a special variable which is the field count for the current line: how many fields there are.

NR is the current line number ("record").

if, for, while, and do-while are like in C, including the optional braces to have more than one statement in the body.

The "next" statement means that we are done processing for this line, and no other clauses are considered for execution. It's rather like C's "continue" statement. For example, instead of the second-previous program we could have:

awk '
NF == 2 {
    print "unknown", $1 + $2
    next
}
{
    print $3, $1 + $2
}'

There is also an "exit" statement: In the main loop, this advances to the END clause if any; in the END clause, it exits really. In either case, no further lines are processed.

In awk, quite unlike C, strings are "first-class values" -- you can pass them around, you can store them in variables, they can be the value of an expression, etc. So we need a few new operators.

The string concatenation operator is simply juxtaposition. If x is "world" and we want to set y to be the string "Hello, world", we can write

y = "Hello, " x

You can compare strings for equality, just with "a == b".
There is also a regular-expression-matching operator, "~":
"$1 ~ /A..n/" means that (i.e. tests whether) $1 is a four-character string whose first character is 'A' and last is 'n'. There's a corresponding "!~" operator.

Other string functions:

length(s) -- returns an integer which is the length of the string. Often used in conjunction with substr():
substr(s, i, len) -- returns a portion of the string, starting at index i (i is one-origin), taking up to len characters (it's allowed for the string to end sooner than that)
index(s, t) -- returns an integer index of string "t" as a substring of string "s", first occurrence. Also one-origin. 0 for not found.
split(s, newarray, separators) -- invokes awk's tokenization algorithm on the string, possibly with a different separator than what you're using overall in this awk program, and puts it as a bunch of array members in "newarray". We didn't do an example of this; I just think you should remember (perhaps just vaguely) that it exists for possible future reference, outside this course.

Since strings are first-class values, the sprintf() function does not need its parameter which in C indicates a target char array in which to store its result. Instead, the format string and all '%'-invoked arguments are the only parameters to awk's sprintf() function, and it returns a value.

x = sprintf(...)

$0 is the whole line. E.g. "print" (with no arguments) is short for "print $0".

"associative arrays" -- arrays which are indexed by anything. A very nifty and useful data structure.

Word frequency count, assuming one word per line and no punctuation and such (perhaps achieved with a cute 'tr' command):

awk '
{
    freq[$1]++
}
END {
    for (i in freq)
	print i, freq[i]
}'

Speaking of cute 'tr' commands, the pipeline used to demonstrate this in lecture was

tr -cs a-zA-Z \\012 | tr A-Z a-z | awk 'above stuff'

Kernighan, Aho, and Weinberger, The AWK Programming Language

"man awk"

awk -F to change separator

awk -F: '
{
    printf "%s has uid %s\n", $1, $3
}' /etc/passwd

'#' introduces a comment, to the end of the line.

#!/bin/awk ... or is that /usr/bin? There's no PATH facility for the "#!" line...
so instead use
awk -f file -- means that 'file' is the awk program, rather than argv[1].
or just embed it in your sh script in single quotes. As you will have noticed by now, we're allowed to have newlines within single quotes in sh. The newline is part of the argv member string.

Passing in shell variables:

if [ "$level" = novice ]
then
    uid='user id number'
else
    uid=uid
fi

awk -F: '
{
    printf "%s has '"$uid"' %s\n", $1, $3
}' /etc/passwd

Note carefully that sequence: '"$uid"'
The surrounding single quotes are actually putting the sequence "$uid" outside the single quotes we're already in. That is, the first single quote is ending the single quote started on the awk -F: line, and the later single quote resumes single-quoting. This is allowed; if there's no unquoted space inbetween them, we'll still be putting together the same argv member string.

Which is why we need to put "$uid" in double-quotes. If that variable's value has a space in it (which is one of the 'if' forks; it's quite possible), we want that not to stop one argv string and start another.

This is tricky. I think that you should be able to analyze what's happening as above, but I think that it's fine if you wouldn't be able to come up with this temporarily-stopping-the-single-quotes strategy.

One further awk topic: output to files.

You can use '>' in print and printf statements. It's a special syntactic character which is part of the print and printf statement syntax.

print "something" >"/tmp/ajr"

Note the use of quotes around the file name because it is a string. We need quotes in awk as much as in C, unlike as in sh.