|
Page 1 of 2 Awk Tutorial Part 1 If you are system administrator or developer, you need to process log files to have a better grasp of situation. Many people use Perl or Python to help with this task. However, many times using one of the P languages is overkill. Furthermore, every single day I am on a machine that I cannot make changes to and thus cannot use my helper script. However, awk has the tools available to solve most on-the-fly log processing problems, directly from the command line. In addition, awk can provide a more concise and faster solution the the pipeline of cut, grep, sort, and other commands you are currently using.
In this article, this is the format of the file I am working with: $ tail -n 1 access_log-2008-01 1.1.1.1 - - [10/Jan/2008:17:26:51 -0600] "GET / HTTP/1.1" 200 38856 Basically what we have here is: ip address, date, request, response code, response size. (Ignoring the dashes after the ip address.) How would you find the largest response sent by your HTTP server? My typical solution has always been: $ awk '{print $NF}' access_log-2008-01 | egrep -v '\-' | sort -n | tail -n 1 10678272 However, there is clearly a better solution. By default, awk splits input lines by spaces, and assigns the entire line to $0, each field to $n, and the number of fields to NF. See this example: $ echo a b c d e f | awk '{print $0}' a b c d e f $ echo a b c d e f | awk '{print $1}' a $ echo a b c d e f | awk '{print $2}' b $ echo a b c d e f | awk '{print NF}' 6 Note that you can print the last field by saying print the (NF)’s variable: $ echo a b c d e f | awk '{print $NF}' f Or print the second variable from the end: $ echo a b c d e f | awk '{print $(NF-1)}' e Look at my example again: $ awk '{print $NF}' access_log-2008-01 | egrep -v '\-' | sort -n | tail -n 1 10678272 That solution starts three processes and filters the data three times. That is exceedingly inefficient! How about this: $ awk '{if ($NF > max) { max = $NF;}} END {print max}' access_log-2008-01 10678272 This starts one process and filters the data only one time. That command in English says: For each line, if the last field is greater than the max, set it to the variable “max”. Once we have processed all the lines, print the variable max. Which command do you suppose is faster? $ time awk '{print $NF}' access_log-2008-01 | egrep -v '\-' | sort -n | tail -n 1 10678272 real 0m1.107s user 0m1.070s sys 0m0.037s $ time awk '{if ($NF > max) { max = $NF;}} END {print max}' access_log-2008-01 10678272 real 0m0.207s user 0m0.194s sys 0m0.012s
|