COUNT and “GROUP BY” with awk

I want to share with you a small shell script that I often use for reporting purposes, which allows you to simulate the famous “count and group by” well known from DBAs. No database is required. It only use “awk” which is installed with each Linux / Unix distribution (I think?), so you can use it “out of the box”.

In other words: the script allows you to count the number of time a word appears in a text file and display a formatted result.

There is a lot of case in which this code snippet can be useful. Personally, I use it to find on which instance are my RMAN channels allocated when I backup RAC databases (based on rman log files).

For my examples, I will use the following “dummy” file:

# cat test.txt
titi toto
toto
titi
toto,
Toto_
	-tIti
 #test
^tutu

Simple version

Let’s try a first version (if you want the final result, you can jump to the end of this post directly).

awk '{
for (field = 1; field <= NF; field++)
frequency[$field]++
}
END {
for (word in frequency)
printf "%s\t%d\n", word, frequency[word]
}' test.txt

Output:

titi    2
#test   1
^tutu   1
Toto_   1
toto,   1
-tIti   1
toto    2

The script is pretty simple, but let’s have a look on how does it works.

for (field = 1; field <= NF; field++)
frequency[$field]++

This “for loop” iterates through all fields in the current record (here 1 record=1 line):

  • field = 1: we want to start at the first position in the record.
  • field <= NF; field++: while we do not reach the end of the record (NF=awk special variable which means “number of fields in the record”), we keep going.
  • frequency[$field]++: the most important part! “$field” will be interpreted by awk as “the *th field” in the record.

So, in my example:

  • First loop on first record: $field=titi
  • Second loop on first record: $field=toto
  • First loop on second record: $field=toto

The “frequency” array will look like:

  • First loop on first record: frequency[titi]=1
  • Second loop on first record: frequency[toto]=1
  • First loop on second record: frequency[toto]=2

Finally, the script display the words and their frequency in the file:

for (word in frequency)
printf "%s\t%d\n", word, frequency[word]

Nothing special here. We are just formatting the output with tab (\t) and newline (\n).

For a first version, it looks pretty good! But some enhancements can be welcome:

    • Make the script case insensitive (toto=Toto=TOtO…)
    • Ignore special characters (^#-…)

More advanced version

To reach our goal, we simply have to add two additional lines:

awk '{
$0 = tolower($0)
gsub(pattern = "[^[:alnum:][:blank:]]", "", $0)
for (field = 1; field <= NF; field++)
frequency[$field]++
}
END {
for (word in frequency)
printf "%s\t%d\n", word, frequency[word]
}' test.txt

Output:

titi    3
test    1
tutu    1
toto    4

Here we go 🙂

Explanations:

$0 = tolower($0)

Record is converted to lowercase.

gsub(pattern = "[^[:alnum:][:blank:]]", "", $0)

This regex pattern is no trivial, but you can read it as: replace everything that is not (^) an alphanumeric character ([:alnum:]) or a blank characters ([:blank:]) with an empty string (“”) in the the current record ($0).

If you need to conserve some special characters, you can add them to the regex. For example:

gsub(pattern = "[^[:alnum:][:blank:]_#^]", "", $0)

It will not remove “_”,”#” or “^” from record.

Output:

titi    3
#test   1
^tutu   1
toto_   1
toto    3

If you need to validate your regex, you can try the excellent regex101.com!

As mentioned in my introduction, I use this script to find on which instance are my channel allocated when I backup RAC database. The full script looks like this:

grep "instance=" rman_log.log|awk -F "=" '{print $3}'|awk '{print $1}'|awk '{
$0 = tolower($0)
gsub(pattern = "[^[:alnum:][:blank:]]", "", $0)
for (field = 1; field <= NF; field++)
frequency[$field]++
}
END {
for (word in frequency)
printf "%s\t%d\n", word, frequency[word]
}'

Output:

prod1  2
prod2  2
prod3  2

Pretty simple!

Have a nice day, and stay tuned for more DBA stuff!

Leave a Reply

Your email address will not be published. Required fields are marked *