How To Parse Text Files Line by Line in Unix scripts

I’m finally back from my holidays and thrilled to be sharing next of my Unix tips with you!

Today I’d like to talk about parsing text files in Unix shell scripts. This is one of the really popular areas of scripting, and there’s a few quite typical limitations which everyone comes across.

Reading text files in Unix shell

If we agree that by “reading a text file” we assume a procedure of going through all the lines found in a clear text file with a view to somehow process the data, then cat command would be the simplest demonstration of such procedure:

redhat$ cat /etc/redhat-release
Red Hat Enterprise Linux Client release 5 (Tikanga)

As you can see, there’s only one line in the /etc/redhat-release file, and we see what this line is.

But if you for whatever reason wanted to read this file from a script and assign the whole release information line to a Unix variable, using cat output would not work as expected:

bash-3.1$ for i in `cat /etc/redhat-release`; do echo $i; done;
RedHat
Enterprise
Linux
Client
release
5
(Tikanga)

Instead of reading a line of text from the file, our one-liner splits the line and outputs every word on a separate line of the output. This happens because of the shell syntax parsing – Unix shells assume space to be a delimiter of various elements in a list, so when you do a for loop, Unix shell interpreter treats each line with spaces as a list of elements, splits it and returns elements one by one.

How to read text files line by line

Here’s what I decided: if I can’t make Unix shell ignore the spaces between words of each line of text, I’ll disguise these spaces. Since my solution was getting pretty bulky for a one-liner, I’ve made it into a script. Here it is:

bash-3.1$ cat /tmp/cat.sh
#!/bin/sh
FILE=$1
UNIQUE='-={GR}=-'
#
if [ -z "$FILE" ]; then
        exit;
fi;
#
for LINE in `sed "s/ /$UNIQUE/g" $FILE`; do
        LINE=`echo $LINE | sed "s/$UNIQUE/ /g"`;
        echo $LINE;
done;

As you can see, I’ve introduced an idea of a UNIQUE variable, something containing a unique combination of characters which I can use to replace spaces in the original string. This variable needs to be a unique combination in a context of your text files, because later we turn the string back into its original version, replacing all the instances of $UNIQUE text with plain spaces.

Since most of the needs of mine required such functionality for a relatively small text files, this rather expensive (in terms of CPU cycles) approach proved to be quite usable and pretty fast.

Update: please see comments to this post for a much better approach to the same problem. Thanks again, Nails!

Here’s how my script would work on the already known /etc/redhat-release file:

bash-3.1$ /tmp/cat.sh /etc/redhat-release
Red Hat Enterprise Linux Client release 5 (Tikanga)

Exactly what I wanted! Hopefully this little trick will save some of your time as well. Let me know if you like it or know an even better one yourself!

Related books

If you want to learn more, here’s a great book:


classic-shell-scripting
Classic Shell Scripting

See also: