In Linux text processing, one of the most basic and frequent operations is to Extracting a Substring in Bash from a string. We’ll look at a few different Linux command line methods for extracting substrings in this tutorial.

2. Introduction to the Problem:

A substring is a portion of a string, as its name implies. The task at hand is rather simple: we need to extract a specific segment from a given string. Nonetheless, there are two categories of requirements for extraction: pattern-based and index-based.

Let’s illustrate the two different requirements with a couple of examples.

The start and end indexes of the original string define an index-based substring. Let’s examine an example of extracting a substring based on an index.

 

We wish to extract the substring from index positions 4 through 8 given the input string, “0123Linux9.” The anticipated outcome is “Linux.”

Let’s look at an illustration of a pattern-based substring next.

Let’s take an example where we have the input string “Eric, Male,28,USA.” It is a string with the values Name, Gender, Age, and Country separated by commas.

Let’s now assume that we wish to extract Eric’s age, or field number 28. Given that the Name and Gender have dynamic lengths in this instance, we are unable to predict the start index of the target substring. As a result, the implementation will differ from the extraction based on an index.

In this article, we’ll address some common ways to extract substrings in the Linux command line. Of course, we’ll cover both extraction types.

3. Extracting an Index-Based Substring:

First, let’s have a look at how to extract index-based substrings. We’ll introduce four ways to do this:

Next, we’ll see them in action.

3.1. Using the cut Command

We can extract from the Nth to the Mth character from the input string using the cut command: cut -c N-M. 

We need to take the substring from index 4 through index 8, as we covered in a previous section.

Since we are discussing the index in the context of Bash, it is a 0-based index.

 

Consequently, we must add one to the beginning and ending index if we wish to use the cut command to solve the problem. The range will therefore become 5-9.

Let’s now investigate whether the cut command can resolve the issue:

$ cut -c 5-9 <<< '0123Linux9'
Linux

As the output shows, we got the expected substring, “Linux“, so problem solved.

In the example above, we passed the input string to the cut command via a here-string and saved an echo process.

3.2. Using the awk Command

The Swiss army knife of Linux text processing solutions, awk, is a handy tool to have on hand.

We can call the built-in substr() function in Awk script directly to obtain the substring.

The function substr(s, i, n) takes three parameters. Let’s examine them in more detail:

  • s – The input string
  • i – The start index of the substring (awk uses the 1-based index system)
  • n – The length of the substring. If it’s omitted, awk will return from index i until the last character in the input string as the substring.

Now let’s see if awk‘s substr() function can give us the expected result:

$ awk '{print substr($0, 5, 5)}' <<< '0123Linux9'
Linux

Good! The awk command works as expected.

Here we pass i=5. This is because we need the 1-based index. The second argument, 5, is the length of the target substring, and we get it by 8-4+1.

3.3. Using Bash’s Substring Expansion

We’ve seen how simple it is to extract index-based substrings using cut and awk.

As an alternative, Bash can handle the issue because it allows substring expansion using ${VAR:start_index:length}.

Bash is now the standard shell in a lot of contemporary Linux distributions.Stated differently, we are able to solve the issue without the need for an outside command:

$ STR="0123Linux9"
$ echo ${STR:4:5}
Linux

As we can see in the output above, we solved the problem using pure Bash.

3.4. Using the expr Command

Although Bash is included in most Linux distributions, some Linux systems—especially those used in embedded Linux—still come without it.

One of the commands in the Coreutils package is expr. It can therefore be accessed on any Linux system.

 

In addition, the substr subcommand of expr allows us to quickly extract substrings based on index:

expr substr <input_string> <start_index> <length>

It’s worth mentioning that the expr command uses the 1-based index system.

Let’s use expr with the substr command to solve our problem:

$ expr substr "0123Linux9"5 5
Linux

The output above shows that the expr command has solved the problem.

4. Extracting a Pattern-Based Substring

We now know multiple techniques for extracting substrings based on indexes. We will then examine the pattern-based substrings in this section.

Even though the solutions don’t look exactly like the index-based ones, learning them isn’t too difficult.

We will discuss two methods for resolving our issue:

  • Using the cut command
  • Using the awk command

Further, we’ll have a look at a different pattern-based substring extraction problem.

4.1. Using the cut Command

When handling field-based data, the cut command is a useful tool.

Let’s quickly go over our issue. The values in our input string, “Eric,Male,28,USA,” are separated by commas. Our objective is to extract the third field, “28.”

We can ask cut to provide the third field (-f 3) and inform it that the string is separated by commas (-d,) in order to solve the problem:

$ cut -d , -f 3 <<< "Eric,Male,28,USA"
28

We got the expected result and solved the problem.

4.2. Using the awk Command

awk is also good at handling field-based data. A compact awk one-liner can solve the problem:

$ awk -F',' '{print $3}' <<< "Eric,Male,28,USA"
28

Moreover, we can use awk to create more universal solutions because its field separator (FS) supports regex.

For example, “Eric, Male, 28, USA” would appear if we were to alter the input string by inserting a space after each comma. This is a typical format that we encounter in daily life.

Then, using the cut command to fix the issue wouldn’t be a wise move. This is due to the fact that the cut command only allows one character to be used as a field delimiter.

However, it’s still a piece of cake for awk:

$ awk -F', ' '{print $3}' <<< "Eric, Male, 28, USA"
28

We can even write one awk command to work for both cases. This could be a useful trick in the real world:

$ awk -F', ?' '{print $3}' <<< "Eric, Male, 28, USA"
28
$ awk -F', ?' '{print $3}' <<< "Eric,Male,28,USA"
28

4.3. A Different Pattern-Based Substring Case

Our issue with Eric’s age has been resolved thus far. Our input for this problem is a value that is field-based.

In actuality, though, the pattern-based substring might not always be found in a CSV entry. Check out this other example.

With the input string “whatever dataBEGIN:Interesting dataEND:something else,” our objective is to extract the substring that falls between “BEGIN:” and “END:.” Specifically, between two patterns.

Clearly, in this situation, the cut command is of no use. It’s still not a challenge for Awk, though.It has multiple solutions to this issue.

Now let’s see how awk handles it. To facilitate reading of the commands, we store the input string in a variable called $STR:

$ STR="whatever dataBEGIN:Interesting dataEND:something else"
$ awk -F'BEGIN:|END:' '{print $2}' <<< "$STR"
Interesting data

$ awk '{ sub(/.*BEGIN:/, ""); sub(/END:.*/, ""); print }' <<< "$STR"
Interesting data

The first awk command defines “BEGIN:” or “END:” as the field separator and takes the second field.

However, the second awk solution doesn’t tweak the field separator. Instead, it applies two regex substitutions to achieve the goal:

  • sub(/.*BEGIN:/, “”) – Removes everything from the beginning of the string until “BEGIN:
  • sub(/END:.*/, “”) – Removes from “END:” until the end of the input string

After the execution of these two substitutions, we’ll have our expected result. All we need to do is print it out.

5. Conclusion

Linux text processing basics include the ability to extract a substring. The substring extraction can be pattern-based or index-based, depending on the needs.

We covered the topic of extracting substrings in both types in this article with examples.

We also looked into the capabilities of the useful text processing program awk.

Get more information about
Using Logical Operators in Bash: A Comprehensive Guide