AWK¶
Need a video?
Learning outcomes
- Learners can use
awk
- Learners have practiced using a book on AWK
- Learners can use
awk
in pipes - Learners can use
awk
to read a specific column - Learners can use
awk
to transform text - Learners can use regular expressions in
awk
- Learners have practiced reading bash commands
For teachers
Lesson plan:
Time | Minutes | Duration | Description |
---|---|---|---|
11:20-11:25 | 0-5 | 5 | Prior |
11:25-11:30 | 5-10 | 5 | Present |
11:30-11:50 | 10-30 | 20 | Challenge |
11:50-12:00 | 30-40 | 10 | Feedback |
Prior:
- Imagine a file that contains plain-text tabular data. How would you work with it?
- What is a Turing-complete programming language?
- What is AWK (in upper case)?
- What is
awk
(in lower case)?
Why is AWK important?¶
AWK is a programming language for text processing that is included with Linux. As a Turing-complete programming language, it can -by definition- solve any computational problem.
The different spellings
Spelling | Description |
---|---|
AWK | The programming language |
awk |
The program |
Awk | A common misspelling |
Exercises¶
In these exercises, we’ll be using the Bash Guide for Beginners, because this free online book fits this course well and allows you to continue studying after this course.
Exercise 1: printing selected fields¶
Read the text at chapter 6.2.1: ‘Printing selected fields’.
Exercise 1.1: understanding the first part¶
The single line of code in this subsection uses a pipe. Run the command until the pipe. What do you see? How do you explain in English what this does?
Answer
This is the full command shows in this subsection:
The command before the pipe is:
When running it, you will see something similar to:
$ ls -l
total 4
drwxrwxr-x 2 sven sven 4096 Jun 10 2024 bin
drwxr-xr-x 2 sven sven 4096 Jan 8 20:05 Desktop
drwxr-xr-x 10 sven sven 4096 Feb 27 09:44 Documents
drwxr-xr-x 3 sven sven 4096 May 28 08:51 Downloads
Searching the manual of ls
, using man ls
, gives us the
following description of the -l
flag:
Hence, ls -l
lists files in the current folder
using a long listing format.
Exercise 1.2: understanding the awk
part¶
The single line of code in this subsection
forwards its output (from ls
) to awk
.
Run it. What do you see?
How do you explain in English what this does?
Answer
The command to run is:
When running it, you will see something similar to:
In English: from a list of files (in long format), show the fifth and ninth columns.
Exercise 1.3: how awk
deals with missing columns¶
The command shows the fifth and ninth columns of a list of files (in long format).
How does awk
deal with lines that do not have a fifth and/or ninth column?
Answer
When running the command, we have already seen the empty first line of output:
Hence, if there is no fifth and/or ninth column to display,
awk
shows an empty line
(optional) Exercise 1.4: awk
versus cut
¶
Try to use cut
(and only cut
!)
to achieve the same, by selecting the
same columns. This will not work! Observe and explain what you see.
Answer
To get the most similar output, we need to use columns 6 and 10:
This is not because we actually wanted to use columns 6 and 10: Due to the multiple and varying amount of spaces in each line, the counting is off.
If you know the tr
command, you can remove duplicate spaces:
$ ls -l | tr --squeeze-repeats " " | cut --delimiter " " --fields 5,9
4096 bin
4096 Desktop
4096 Documents
4096 Downloads
The output does put a space between the columns, we can remove that too:
Exercise 2: printing selected fields¶
Read the text at chapter 6.2.2: ‘Formatting fields’
Exercise 2.1: understanding the first part¶
The first code example in this subsection uses multiple pipes.
Run the command until the first pipe. What
do you see? How do you explain in English what this does?
Use the ls
manual.
Answer
The first code example in this subsection is:
The command until the first pipe is:
Running it shows something similar to this:
$ ls -ldh *
drwxrwxr-x 2 sven sven 4.0K Jun 10 2024 bin
drwxr-xr-x 2 sven sven 4.0K Jan 8 20:05 Desktop
drwxr-xr-x 10 sven sven 4.0K Feb 27 09:44 Documents
drwxr-xr-x 3 sven sven 4.0K May 28 08:51 Downloads
Using the manual of ls
:
In English: the command shows the list of files and directories (ls
) …
- in a long format (
-l
) - with directories as themselves
(
-d
, also--directory
, i.e. not their contents) - in a human-readable format (
-h
, also--human-readable
).
Exercise 2.2: understanding the second part¶
The first code example in this subsection uses multiple pipes. Run the command until the second pipe. What do you see? How do you explain in English what this does?
Answer
The first code example in this subsection is:
The command until the second pipe is:
Running it shows something similar to this:
$ ls -ldh * | grep -v total
drwxrwxr-x 2 sven sven 4.0K Jun 10 2024 bin
drwxr-xr-x 2 sven sven 4.0K Jan 8 20:05 Desktop
drwxr-xr-x 10 sven sven 4.0K Feb 27 09:44 Documents
drwxr-xr-x 3 sven sven 4.0K May 28 08:51 Downloads
Note that, for the computer used, there is no difference.
In English: the command shows the list of files
for lines that have no match (-v
, also --invert-match
)
to the regular expression total
.
Or shorter: it shows the content, excluding a possible final line that shows the total file size.
Exercise 2.3: understanding the awk
part¶
The first code example in this subsection uses multiple pipes. Run the command in full. What do you see?
Answer
The first code example in this subsection is:
Running it shows something similar to this:
Exercise 2.4: understanding the single quote¶
A first thing to notice is that the awk
command is put
into a single quote '
(instead of a double-quote, "
).
Why is that?
Answer
Text between single quote is used as-is:
Where the other way around, the !
triggers
something:
(Optional) Exercise 2.5: using commas between elements to print¶
Zooming in on the printing of awk
,
i.e. the part print "Size is " $5 " bytes for " $9
,
we can see that the elements to be printed are
separated by a space.
This is unlike most (?all) modern languages, where
elements are separated by a comma. Rewrite the expression
to use a comma between the elements.
Answer
The first attempt would be:
This, however, gives double spaces now:
$ ls -ldh * | grep -v total | awk '{ print "Size is ", $5, " bytes for ", $9 }'
Size is 4.0K bytes for bin
Size is 4.0K bytes for Desktop
Size is 4.0K bytes for Documents
Size is 4.0K bytes for Downloads
Removing the spaces between the double-quotes ("
) solves this:
The output will look similar to this:
Exercise 3: regular expressions¶
Read the text at chapter 6.2.3: ‘The print command and regular expressions’
Exercise 3.1: understanding the first part¶
The first code example in this subsection uses a pipe.
Run the command until the pipe. What
do you see? How do you explain in English what this does?
Use the df
manual.
Answer
The first code example in this subsection is:
The command until the pipe is:
Running it shows something similar to this:
$ df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 1.6G 2.9M 1.6G 1% /run
/dev/nvme0n1p2 468G 226G 219G 51% /
tmpfs 7.6G 29M 7.6G 1% /dev/shm
tmpfs 5.0M 8.0K 5.0M 1% /run/lock
efivarfs 438K 293K 141K 68% /sys/firmware/efi/efivars
/dev/nvme0n1p1 511M 73M 439M 15% /boot/efi
tmpfs 1.6G 148K 1.6G 1% /run/user/1000
Using the manual of df
:
In English: show the file system space usage (df
)
in a human-readable format (-h
, also --human-readable
).
Exercise 3.2: understanding the awk
part¶
Run the command in full. What do you see?
Tip: if you see nothing, use df -h | awk '/dev\// { print $6 "\t: " $5 }'
Answer
The first code example in this subsection is:
Running it shows something similar to this:
On the computer used, this shows no output.
Running the alternative:
It shows the percentage of disk space in use for all the
devices that have dev
in the name.
Exercise 3.3: understanding the regular expression¶
This awk
command uses a regular expression. What is it exactly?
If it is formatted ‘weirdly’, why is that?
Answer
The first code example in this subsection is:
The exact regular expression is the exact text between the (unescaped) slashes:
This is formatted ‘weirdly’, as it uses \/
instead of
just /
. This is because awk
uses /
to indicate the
start and end of a regular expression. Hence, for the same
character to be part of that regular expression, it is escaped
using a backslash.
Exercise 3.4: understanding the \t
¶
In the awk
command, there is a \t
in the printing
part. What does it do, and why is it written like that?
Answer
The first code example in this subsection is:
The \t
prints a tab.
It is written like that,
as \t
is simply decided as the way how we write a tab,
similar to the convention that \n
is a newline.
(optional) Exercise 4: can awk
do …?¶
AWK is a Turning complete language, hence the answer to ‘Can AWK do …?’ (applied to text) is always true.
Below are some question you may have
and how to solve this in awk
.
Pick those topics you are interested in.
(optional) Exercise 4.1: Can awk
display all columns?
Can awk
display all columns? Or: upon a match,
can awk
display the whole line?
The answer is: yes!
Read the text at subsection 6.2.1: ‘Printing selected fields.
Use a pipe to direct the output of ls -l
to awk
,
where the whole line is printed.
Answer
The symbol $0
is used for ‘all columns’/’the whole line’:
On its own, this program is not useful: it just echoes its input.
$0
becomes useful when used with other awk
features, such
as matching lines for a regular expression:
(optional) Exercise 4.2: Can awk
display the line number?
Can awk
display the line number?
Read the text at chapter 6.3.3: ‘The number of records’.
The answer is: yes!
Use a pipe to direct the output of ls -l
to awk
,
where the line number and the values in the first are printed
(optional) Exercise 4.3: Can awk
display the number of columns?
Can awk
display the number of column?
The answer is: yes!
To do so, print the variable NF
, as shown in the
program below:
Use a pipe to direct the output of ls -l
to awk
,
where the number of columns are printed
(optional) Exercise 4.4: Can awk
display the last column?
Can awk
display the last column?
The answer is: yes!
To do so, print the variable $NF
, as shown in the
program below:
Use a pipe to direct the output of ls -l
to awk
,
where the value in the first and last column are printed
(optional) Exercise 4.5: Can awk
count the number of lines?
Can awk
count the number of lines?
The answer is: yes!
Read the text at chapter 6.3.3: ‘The number of records’.
Use a pipe to direct the output of ls -l
to awk
,
where the number of lines is printed.
Answer
A good first guess, but incorrect, is to use the command below, which is good for numbering lines:
The last number is indeed the number of lines.
The AWK way to solve it, is to use the END
clause,
which is only run at the end:
NR
only becomes useful when used with other awk
features,
such printing a descriptive text around it:
There are many ways to print the number of lines,
such as to combine the incorrect awk
way with tail
:
Clumsy, but it works.
Alternatively, wc
is made exactly for the purpose of counting lines:
(optional) Exercise 4.7: Can awk
work on comma-separated files?
Can awk
work on comma-separated files?
The answer is: yes!
Read the text at chapter 6.2.4: ‘The input field separator’.
Here we convert the output of ls -l
to its comma-separated
equivalent:
Using this input, use a pipe to show the fifth and ninth column.
(optional) Exercise 4.8: Can awk
show something once at the start?
Can awk
show something once at the start?
The answer is: yes!
Read the text at chapter 6.2.4: ‘Special patterns’.
Use a pipe to direct the output of ls -l
to awk
,
where the text Permissions:
is shown, after which
the values in the first column are shown.
If the word total
shows up in your results, you can ignore it
for this exercise.
(optional) Exercise 4.9: Can awk
show something once at the end?
Can awk
show something once at the end?
The answer is: yes!
Read the text at chapter 6.2.4: ‘6.2.4. Special patterns’.
Use a pipe to direct the output of ls -l
to awk
,
after which
the values in the first column are shown.
At the end of the output, it should show the text Done!
If the word total
shows up in your results, you can ignore it
for this exercise.
(optional) Exercise 4.10: Can awk
use variables?
Can awk
use variables?
The answer is: yes!
Read the text at chapter 6.3.4: ‘User defined variables’.
Use a pipe to direct the output of ls -l
to awk
.
Sum the values of the fifth column and show it
For teachers
What is the difference between AWK and awk
?
Answer
AWK is the name of the programming language.
awk
is the name of the program that can run AWK.
What can AWK not do?
Answer
AWK, like any Turning complete language, can solve any computational problem, but cannot do this:
- run computations at any speed (i.e. a problem may take billions of year to complete)
- use any amount of memory (i.e. a problem may require billions of gigabytes to solve)
When not to use AWK?
Answer
AWK shines at problems of intermediate complexity.
For simple problems, tools such as grep
, cut
and wc
are just as good.
For harder problems, use a modern programming language instead, as these have libraries/packages that can, for example, read or analyse an entire table at once.
Conclusions¶
Conclusions
awk
can be used pipes:ls -l | awk '{ print $5 $9 }'
awk
can be used to read a specific column:ls -l | awk '{ print $5 $9 }'
awk
can be used to transform text:ls -ldh * | grep -v total | awk '{ print "Size is " $5 " bytes for " $9 }'
awk
can use regular expressions:df -h | awk '/dev\/hd/ { print $6 "\t: " $5 }'
awk
can do a lot more
Learning AWK¶
Learning AWK
Learning resource | Description |
---|---|
A practical guide to learning awk | Book about AWK |
Gawk: Effective AWK Programming | Book about AWK |
Bash Beginners Guide | Book with a chapter about AWK |
Advanced Bash Scripting Guide | Book with a chapter about AWK |
To AWK or not | Course about AWK, by Pavlin Mitev |
AWK course | Course about AWK, by Richèl Bilderbeek |