How to Calculate the Sum of a Column Using the “Awk” Script in Bash

1. Overview

How to Calculate the Sum of a Column Using the “Awk” Script in Bash. We’ll learn how to add up a column of numbers in a Bash shell in this lesson. We’ll examine in more detail a few of the Bash utilities that can be applied in this way. Additionally, we’ll compare the effectiveness of the suggested alternatives.

2. Setup

Let’s first prepare our input file, which will be used for most of the Sum of a Column Using the “Awk” Script in Bash:

$ for i in `seq 1000000`; do echo $(($RANDOM%100)); done >numbers.csv

In this case, one million random numbers between 1 and 100 are contained in the file numbers.csv that we are creating. We’re utilizing the RANDOM built-in variable to generate 1,000,000 numbers using a for loop and the seq command.

To gain an understanding of how each command functions, we’ll also examine the local speed of the answers provided using the time command in the sections that follow.

3. Using the awk Tool

To begin, let’s use the awk command to get the total of the values in a column:

$ awk '{Total=Total+$1} END{print "Total is: " Total}' numbers.csv
Total is: 49471228

Let’s now use the time command to examine the execution time:

$ time awk '{Total=Total+$1} END{print "Total is: " Total}' numbers.csv
Total is: 49471228

real    0m0.228s
user    0m0.141s
sys     0m0.047s

It moves really quickly! In 0.228 seconds, we could compute the sum of a million numbers. Awk is actually one of the most potent file processing tools in Bash.

3.1. When the File Contains Multiple Columns

Thus far, we have learned how to use awk to sum up the integers in a column. Let’s examine the scenario in which we want to determine the Sum of a Column Using the “Awk” Script in Bash:

$ cat prices.csv
Books,40
Bag,70
Dress,80
Box,10

There are two columns in csv. Let’s now determine the total of the items in the second column:

$ awk -F "," '{Total=Total+$2} END{print "Total is: " Total}' prices.csv
Total is: 200

3.2. When the File Contains a Header Line

A header line can also occasionally be found in text or CSV files. For easier reading, the column names are typically held in this header line. Let’s change what we charge.csv, then include a header line:

$ cat prices.csv
Item,Value
Books,40
Bag,70
Dress,80
Box,10

When a header line appears in the file, we should remove it before any text processing happens. There are numerous approaches to achieving this. In this instance, the header line will be ignored using the awk tool. Now let’s change the command we used to get the column sum:

$  awk -F "," 'NR!=1{Total=Total+$2} END{print "Total is: " Total}' prices.csv
Total is: 200

The performance of the awk solution will be compared to several alternative methods of adding integers in a column, which will be examined in the following sections.

4. Iterating with the Bash Loops

Although awk is a fantastic tool, we could also loop through the column values.

4.1. Using the expr Command

To test the efficacy of the expr command in calculating the total inside the for loop, let’s conduct an experiment:

$ time (sum=0;for number in `cat numbers.csv`; do sum=`expr $sum + $number`; done; echo "Total is: $sum")
Total is: 49471228

real    212m48.418s
user    7m19.375s
sys     145m48.203s

Processing took an awfully long time. It took more than 3.5 hours to add a million numbers using the expr command. Interestingly, we should only utilize the expr utility—a throwback to Bash’s early days—when our scripts require to work with older (pre-POSIX) implementations.

4.2. Using Arithmetic Expansion

Since the use of the expr command didn’t help much, let’s try another approach using arithmetic expansion:

$ time (sum=0;for number in `cat numbers.csv`; do sum=$((sum+number)); done; echo "Total is: $sum")
Total is: 49471228

real    0m1.961s
user    0m1.813s
sys     0m0.125s

Here, we are utilizing the $((..)) syntax to calculate the sum using the arithmetic expansion. Using arithmetic expansion, we were able to add a million numbers in two seconds, in contrast to the expr command. Mathematical expansion enables us to carry out basic integer calculations. Nevertheless, it is not compatible with floating-point numbers. For floating-point operations, therefore, we have to utilize the bc command. In the following part, we’ll examine how the bc command is implemented.

5. Adding Values With the bc Command

A calculation is done on a single-line expression using the BC command. As a result, we must put the integers together on a single line and divide them using the addition operator. The expression will then be passed to BC in order to determine the total. Let’s examine many approaches of achieving this.

5.1. Using the paste Command

Sum of a Column Using the “Awk” Script in Bash, Let’s start by examining the paste command, which puts the first ten values in our dataset on a single line separated by a plus sign (+):
$ cat numbers.csv| head -10 | paste -sd+ - 
2+44+6+15+23+0+15+88+82+1

The option -s ensures that paste joins all the numbers in a single line. We also specified the d+ option to add the ‘+’ character as a delimiter while joining the entries.

With that, we’re ready to provide this sequence as stdin to the bc command:

$ time echo "Total is: $(cat numbers.csv | paste -sd+ - | bc)"
Total is: 49471228

real    0m0.244s
user    0m0.203s
sys     0m0.063s

Notably, the performance is better than what we observed with Bash loops (~ 2 seconds). Also, it came close but couldn’t beat the performance of the awk command (0.228 seconds).

5.2. Using the tr Command

Similar to the paste command, let’s generate a sequence again using the tr command:

$ cat numbers.csv | head -10 |tr "\n" "+"
2+44+6+15+23+0+15+88+82+1+

Here, we translated each newline (‘\n’) to the plus (‘+’) character. But, notice the extra ‘+’ at the end of the sequence. As a workaround, we can add an extra zero at the end to take care of this before we pass it to the bc command:

$ cat numbers.csv | head -10 |tr "\n" "+" ; echo "0"
2+44+6+15+23+0+15+88+82+1+0

Now, let’s redirect the output to the bc command:

$ time ((cat numbers.csv | tr "\n" "+" ; echo "0") | bc)
49471228

real    0m0.217s
user    0m0.203s
sys     0m0.031s

A combination of tr and bc command executes faster than the awk solution.

5.3. Using the sed Command

Finally, we’ll use the sed command to generate the sequence:

 $ cat numbers.csv | head -10 | sed -z 's#\n#+#g'
2+44+6+15+23+0+15+88+82+1+

Once more, we’ve used the sed command’s search and replace feature to swap out newlines (‘\n’) with plus (‘+’) characters. Furthermore, in keeping with the preceding section, we’re writing zero at the conclusion to handle the extra plus operator:

$ time ((cat numbers.csv | sed -z 's#\n#+#g' ; echo "0") | bc)
49471228

real    0m0.343s
user    0m0.281s
sys     0m0.109s

Here, the newline for the sed command has a different meaning when the -z option is used. The null character will be read as the end of the line instead of \n, which it will no longer recognize as the end of the line. In essence, we might use the plus (‘+’) character in place of the newline (‘\n’).

Note that replacing the characters with sed is slow as compared to the tr and the paste options

In such cases, awk should be the go-to tool because all the speed advantages of non-awk alternatives will be eaten up by the time taken to pre-process the file for extracting a single column before calculating the sum of its elements.

6. Conclusion:

In this tutorial, we discussed ways to add up a column of numbers in a Bash shell.

We started by discussing the solution using the awk tool. Additionally, we also explored ways to handle files containing multiple columns or a header line.

Then we implemented the solution using Bash loops. Finally, we presented techniques to calculate the sum with the bc command by converting the number into a single-line expression with the trpaste, and sed utilities.

Leave a Reply

Your email address will not be published. Required fields are marked *