Software tools

This text file is presented instead of a video for the second (and main) part of the topic Software Tools.

The "software tools" idea is about writing small, simple programs, which do one thing well; and having powerful and general ways to combine them.

In unix, we can connect programs using a "pipe", which is a kernel data structure which is a buffer but has the special property that it can be used for i/o redirection.
"command >file" runs "command" but with its standard output redirected into the file "file", as you know.
"cmd1 | cmd2" runs both commands, with the standard output of cmd1 connected into the pipe, and the standard input of cmd2 connected so as to come out of the pipe.
Given how many unix tools are designed as "filters", willing to read from their standard input and write to their standard output, you can make quite a lot of interesting combinations.
"cmd1 | cmd2" is itself a command which can be piped into another command, and so on. You can have a long pipeline, like a factory assembly line, in which each command performs some successive transformation on the data.
Examples:
- given a text file called "composers" listing the names of a number of early music composers, not in order, one per line,
  - "sort composers" yields a sorted file on stdout.
  - "sort −k2 composers" sorts by last name instead of first name
  - "sort −f composers" is a case-insensitive sort
  - you can combine these options by saying "sort −f −k2"
  and
  - "sed" is a "stream editor" — it performs transformations on the data as it goes by, like a filter, without modifying the file on disk
  - the input language to sed is complex and we're not going to look at all of it in this course, just a few of its commands
  - the sed command I'll be using here is "s", for "substitute" — it does a search and replace, but using the unix "regular expression" syntax. Regular expressions are based on the theory of finite automata and are a powerful way to describe sets of strings, such as "all strings containing the letter e" or "all strings containing a letter, five spaces, and then either a digit or the symbol '?'", etc.
  - The 's' command letter is followed by a slash (actually you can use any character as delimiter, but slash is usual... unless the search string has slashes in it), then the search string, then a slash, then the replacement string, then another slash, then any option key letters (we didn't use any of those in the examples in lecture).
  - Altogether we get a command like sed 's/oo/ooo/' to transform double-os into triple-os.
  - A more serious transformation can be described by the complex substitution sed 's/$.*$ $.*$/\2, \1/'
    The backslashed-parentheses indicate grouping. Dot (".") matches any character. Asterisk means zero-or-more of the preceding. So altogether the search string means zero or more of any character (i.e. anything at all), then a space, then zero or more of any character. But the parenthesized groups can be referred to in the replacement string. The replacement string means "the second thing we matched in the search, a comma, a space, then the first thing we matched in the search". Altogether this transforms the line "Alan Rosenthal" into the line "Rosenthal, Alan".
  - Example: sed 's/$.*$ $.*$/\2, \1/' composers
- So we have two useful tools so far: sort and sed.
- We can combine them with a pipe, to transform the list of composers into a list which has last name first and is sorted.
- Example: sort -f -k2 composers | sed 's/$.*$ $.*$/\2, \1/'
- The sort options are affected by which order we do these transformations in; if we've already done the "sed", then we want to sort by the first field (the last name in the sed-transformed file) as opposed to the second field (the last name in the original file).
- Note that all of these "filters" have optional file name arguments: If you supply a list of file names, it reads from those. If you don't supply any file names, it reads its standard input.

Cute quotation: "Unix is user-friendly; it's just choosy about who its friends are."

Summary: "Do one thing well."

Tools which do just one thing can be combined in arbitrary ways.

One thing a bit odd in unix is that program output doesn't contain headers.

Consider the "who" command. Example output:

ajr      console  Jan  8 06:28
ajr      ttyp1    Jan  8 09:25
ajr      ttyp2    Jan  8 09:26

(The "who" output is more exciting on a system with multiple users, especially if no one's on the console and creating multiple terminal windows.)

We can see how many entries there are by using the "word count" program "wc", with the option "−l" which means "only display the line count":

$ who | wc -l
       3
$

On many non-unix systems we would expect output with a header, identifying the columns, like this:

User     Terminal  Login time
------------------------------
ajr      console  Jan  8 06:28
ajr      ttyp1    Jan  8 09:25
ajr      ttyp2    Jan  8 09:26

But this would cause problems for the software tools model. In the "who | wc −l" case, the line count above would be off by two; in fact we would get funny results from many tools. For example, a "grep" (display only lines matching a search expression) to see who is logged in and has a "−" in their logname would also display the header separation line, or if a user were named "ogi", then "who | grep ogi" would also display the header line.

Command lines

Another background matter about how the shell parses commands has the unfortunate name "globbing". It's the expansion of filename patterns.

For example, '*' matches any number of any character.
"cat comp*" would output the "composers" file because that file name matches that pattern.

To illustrate how the globbing patterns work, we will use the 'echo' command, which just outputs its arguments.

$ echo hello
hello
$

But it outputs those arguments after all expansions and substitutions.

$ echo comp*
composers
$

So below we illustrate how

'*' matches any number of any character (in a very general way, not just the patterns like "*.py" you might be used to)
'?' matches any one character
a list of characters in square brackets matches any one of them
- the list of characters in square brackets can also contain ranges, like "a−z"
- these can be combined, e.g. "[a−xz]" matches any lower-case letter except 'y'

To play along, download the file toolsfiles/demo.tar , and extract its contents with the command tar xf demo.tar
(or if you are working on a teach.cs computer, you can just do tar xf /u/csc209h/summer/pub/02/demo.tar )
and then cd to the newly-created demo directory.

We begin with an 'ls' to show the list of file names in the current directory, because the expansion of the glob patterns depends on this:

$ ls
a.pdf		a3.pdf		aw.pdf		gcd.c		newstudentlist
a1.pdf		a4.pdf		composers	grades		people
a12.pdf		abc		firstbyte.c	hello		studentlist
$ echo *.c
firstbyte.c gcd.c
$ echo *1*f
a1.pdf a12.pdf
$ echo a?.pdf
a1.pdf a3.pdf a4.pdf aw.pdf
$ echo a*pdf
a.pdf a1.pdf a12.pdf a3.pdf a4.pdf aw.pdf
$ echo a[0-9].pdf
a1.pdf a3.pdf a4.pdf
$ echo a[w1Q]*pdf
a1.pdf a12.pdf aw.pdf
$

A '.' at the beginning of the file name is treated specially: it is only matched explicitly, not by a '*' or '?'. (There's no special treatment of dots anywhere else in the name.)
"ls" also does not report files whose names begin with a dot, unless you give the "−a" option.

$ echo *c
abc firstbyte.c gcd.c
$ ls -a
.		a1.pdf		abc		gcd.c		people
..		a12.pdf		aw.pdf		grades		studentlist
.abc		a3.pdf		composers	hello
a.pdf		a4.pdf		firstbyte.c	newstudentlist
$ echo .*c
.abc
$ echo .*
. .. .abc
$

Software tools in unix

To make good use of the software tools which are out there, you need to know what they are. Here are a bunch of unix software tools which you want to have in your toolbox.

All of these commands, and all other commands, have man pages. You'll want to get used to reading man pages, especially to find obscure options.

I frequently read man pages. The on-line help in unix is very comprehensive. There's a lot to know and you don't have to remember it all.

If you haven't already done so above, please download the file toolsfiles/demo.tar , and extract its contents with the command tar xf demo.tar
(or if you are working on a teach.cs computer, you can just do tar xf /u/csc209h/summer/pub/02/demo.tar )
and then cd to the newly-created demo directory.
(Or, the individual files are usually linked to below, but it's smoother to have the whole demo directory in advance.)

grep

An example filter is "grep". It outputs lines which match a pattern.

Where '$' is the shell prompt, and given a text file called "composers" (which is in that demo.tar file),

$ grep Q composers
$

(that's all zero of them)

$ grep H composers
Henry Purcell
Hildegard von Bingen
Heinrich Schuetz
$

We can combine this with other commands with a pipeline, as shown in the introductory section of this document:

$ who
ajr      console  Jan  8 06:28
ajr      ttyp1    Jan  8 09:25
ajr      ttyp2    Jan  8 09:26
$ who | grep ajr
ajr      console  Jan  8 06:28
ajr      ttyp1    Jan  8 09:25
ajr      ttyp2    Jan  8 09:26
$ who | grep 09:25
ajr      ttyp1    Jan  8 09:25
$

In the pipelines,

the "who" program's output is redirected into the pipe
the "grep" program's input is redirected from the pipe
both programs run in parallel, although the grep program blocks waiting for input until the who program provides it.

Data goes into a command via the standard input, but also via command-line arguments, as in the arguments to grep above.

Find lines which match a regular expression. Examples (some demonstrated in class):

    who | grep ajr
    grep /~ajr/209/ /var/httpd/log/access_log
    lpq | grep ajr | cut -f1 | xargs lprm

tr

Does character-level substitution, e.g. change all 'x's to 'y's.
tr does character ranges, and you can also identify a character by its octal number by writing a backslash followed by three octal digits (you'll need to quote that backslash so that it's not interpreted by the shell but rather passed straight through to tr).
Examples:

    tr o Q
    tr '\015' '\012' <file.mac >file.unix
    tr A-Z a-z
    tr a-zA-Z n-za-mN-ZA-M

(try these! except for the macintosh one I guess)

head, tail

The first or last n lines of a file (n=10 by default). Examples:

    last | head
    tail /var/log/messages
    tail -40 /var/log/messages

sort

Sorts the input. Try this on the "composers" file, or any other.
"−k" specifies which key field number to sort on (field 1, the left of the line, by default).
"−n" does a numeric sort instead of the usual sort (e.g. 12 comes after 2).

    sort
    sort -k2
    sort -n
    sort -n -k3

lots of other options such as case-insensitive, reverse order — see the man page.

uniq

Collapse adjacent identical lines.
Used in the following example which gives you a word-frequency count in a document, also introducing other features of tr:

    tr -cs a-zA-Z0-9 '\012' <file | tr A-Z a-z | sort | uniq -c | sort -n

sed

"Stream editor": do edit commands on a file as it goes by in the pipeline. Examples:

    sed s/Fred/Wilma/ people
    sed s/Fred/Wilma/g people
    sed 's/Fred[a-z]*/Wilma/g' people
    sed 5d people
    sed /pattern/d people

sed takes arbitrary regular expressions.
Note that this is not the same syntax as the glob expressions!
For example, "[a-z]*" above means "any number of lower-case letters", whereas in the glob notation it would mean one lower-case letter followed by anything.

The argument to sed often has to be quoted so that special characters in it aren't interpreted by the shell (e.g. as glob notation!).

I wrote a quick intro to unix regular expression syntax.

If you enclose some of the search string in backslashed parentheses, \1 in the replacement means the first such match. If you have multiple pairs of backslashed parentheses, you can also use \2, etc.
Whether some of the search string is enclosed in backslashed parentheses or not, '&' in the replacement string represents the entire search string.

Examples (try them!):

sed 's/[A-Z]/ capital-& /g' composers
sed 's/\(.*\) \(.*\)/\2, \1/' composers

echo

Provides output.
−n means not to output the terminating newline.

    echo Please enter repeat count:
    echo -n 'Please enter repeat count: '

→ note how it takes any number of arguments, and outputs them separated by spaces.

Use "tr" to convert x's to y's in xylophone:

can't say tr x y xylophone (even if it took files — xylophone is the data, not a file)
so: echo xylophone | tr x y

cat

lots of interesting options, such as −n to number the lines, −s to eliminate multiple blank lines.

ls

"ls dir" or "ls file"
ls −d to avoid descending into a directory
use xargs to make it read stdin in any interesting way
check out −a, −l, −i, −q, −t, −r
ls strangely (and unsimply) acts differently by default based on whether its output is a "tty" or not (compare "ls" and "ls | cat"), but there are options −C to force columnar output and −1 to force one file per line (mnemonic: "one column")

In general in unix tools, you can combine options into one word with just one minus sign. For example, instead of writing "ls −l −a −r −t" you can write "ls −lart".
Although as soon as you hit an option which takes an argument (such as sort's −k option), that's it for that word. So for example, in "sort −k2f", all of "2f" would be the argument to −k. Although you could rearrange this one: "sort −fk2" is still the same as "sort −f −k2". (This asymmetry is caused by the fact that −k takes an argument and −f does not.)

cp

copy files
either 2 arguments, or multiple arguments and a directory; check out −p and −r

mv

move (rename)
similar command format to cp; always preserves time

rm

remove (delete)
also see options −r, −f
(Don't get in the habit of using −f unnecessarily — loses valuable error messages)

cmp

compare files in a byte-oriented way (especially useful for non-text files).
also try cmp −l

diff

compare files in a text-file-oriented way, cleverly finding matching bits so as to show only the differences.
also try diff −b, also −c

diff is the basis of commands to compare different revisions in many source control systems, e.g. "git diff".

comm

Show lines in common between two files.

Example: students enrolled in CSC 209 before and after the drop date (fictional)
Please try these commands:

comm -1 studentlist newstudentlist
comm -12 studentlist newstudentlist
comm -13 studentlist newstudentlist
comm -23 studentlist newstudentlist

The rule is that the command-line options say which of the three columns to suppress. It's a little odd. Compare with "comm studentlist newstudentlist" with no options, which produces unreadable and useless output but is the key to understanding how the options work.

join

Does a database join of two files. The files must be sorted by the key field you're joining on.
Try:

join newstudentlist grades

There are millions of options to specify what the key fields are, what the output format should be, etc. I think that most people consult the man page every single time they write a join command (which is used more frequently in a shell script than interactively).

The idea of the "−" file name

Almost all commands are willing to accept "−" instead of a file name, and read the standard input at that point.

For example, diff − file will compare the standard input to the contents of "file".

Summary

Software tools are small programs which do one thing well.

Find

find is a program you should also know about. It traverses directory hierarchies. It has a huge number of options. I'm not sure it quite qualifies as a "small program which does one thing well", but it's the right tool for a certain set of tasks.

The command line begins with a list of directories, then contains a list of predicates, usually ending with either "−print" (to print the path name if you get that far, i.e. all of the previous predicates are true) or "−exec" (to execute a command for that file path name).

For exec, in the command you can use "{}" to mean to substitute the path name here in the command-line. Since these characters are special to the shell, they need to be quoted.

The command-line for exec needs to be terminated, else find wouldn't be able to tell where the command ends and the find options resume. It is terminated by a semi-colon (as a separate argument). Since semicolon is special to the shell, it needs to be quoted.

The few examples below are intended to give you an idea of what find can do, not to teach you to use it; you'll learn how to use find as you have particular applications for it.

Basic command to find a file by name in a directory tree:

find /u/ajr/209/web/notes -name cat0.c -print

Sample output:

/u/ajr/209/web/notes/toolsfiles/cat0.c

Find files which are modified within the last 30 days, and execute the "ls −l" command on them. But they might be directories, so we need the "−d" option to ls as well.

find /u/ajr/209/web/notes -mtime -30 -exec ls -ld '{}' ';'

Sample output:

drwxr-xr-x  43 ajr  staff  1462 Feb 17 01:02 /u/ajr/209/web/notes
-rw-r--r--  1 ajr  staff  6964 Feb 17 01:01 /u/ajr/209/web/notes/c
-rw-r--r--  1 ajr  staff  7372 Feb 17 01:02 /u/ajr/209/web/notes/files
drwxr-xr-x  8 ajr  staff  272 Feb 19 13:44 /u/ajr/209/web/notes/sockets
-rw-r--r--  1 ajr  staff  1388 Feb 19 13:44 /u/ajr/209/web/notes/sockets/client.c
-rw-r--r--  1 ajr  staff  1329 Feb 19 13:43 /u/ajr/209/web/notes/sockets/client_inet.c
-rw-r--r--  1 ajr  staff  2830 Feb 19 13:35 /u/ajr/209/web/notes/sockets/server.c
-rw-r--r--  1 ajr  staff  1758 Feb 19 13:37 /u/ajr/209/web/notes/sockets/server_inet.c

As above, but only finding plain files (e.g. excluding directories) (so the "−d" option to ls is no longer important, but doesn't hurt either):

find /u/ajr/209/web/notes -type f -mtime -30 -exec ls -ld '{}' ';'

Sample output:

drwxr-xr-x  43 ajr  staff  1462 Feb 17 01:02 /u/ajr/209/web/notes
-rw-r--r--  1 ajr  staff  6964 Feb 17 01:01 /u/ajr/209/web/notes/c
-rw-r--r--  1 ajr  staff  7372 Feb 17 01:02 /u/ajr/209/web/notes/files
drwxr-xr-x  8 ajr  staff  272 Feb 19 13:44 /u/ajr/209/web/notes/sockets
-rw-r--r--  1 ajr  staff  1388 Feb 19 13:44 /u/ajr/209/web/notes/sockets/client.c
-rw-r--r--  1 ajr  staff  1329 Feb 19 13:43 /u/ajr/209/web/notes/sockets/client_inet.c
-rw-r--r--  1 ajr  staff  2830 Feb 19 13:35 /u/ajr/209/web/notes/sockets/server.c
-rw-r--r--  1 ajr  staff  1758 Feb 19 13:37 /u/ajr/209/web/notes/sockets/server_inet.c

Some I/O redirection details (to be discussed in class)

gcc gcd.c
./a.out >file
sh testgcd >file
- i/o redirection is "inherited" when the sh process runs the other processes
echo hello >hi
echo foo; echo bar; echo baz >fbb
- Only "echo baz" is redirected.
- It's an operator precedence issue (precedence of ";" versus ">").
- The shell implements parentheses to change the precedence!
(echo foo; echo bar; echo baz) >fbb
All of this also applies to pipes and pipelines, e.g. you can have a parenthesized command sequence as one element in a pipeline.