Parsing text files line by line in Unix script

I'm finally back from my holidays and thrilled to be sharing next of my Unix tips with you!

Today I'd like to talk about parsing text files in Unix shell scripts. This is one of the really popular areas of scripting, and there's a few quite typical limitations which everyone comes across.

Reading text files in Unix shell

If we agree that by "reading a text file" we assume a procedure of going through all the lines found in a clear text file with a view to somehow process the data, then cat command would be the simplest demonstration of such procedure:

redhat$ cat /etc/redhat-release
Red Hat Enterprise Linux Client release 5 (Tikanga)

As you can see, there's only one line in the /etc/redhat-release file, and we see what this line is.

But if you for whatever reason wanted to read this file from a script and assign the whole release information line to a Unix variable, using cat output would not work as expected:

bash-3.1$ for i in `cat /etc/redhat-release`; do echo $i; done;
RedHat
Enterprise
Linux
Client
release
5
(Tikanga)

Instead of reading a line of text from the file, our one-liner splits the line and outputs every word on a separate line of the output. This happens because of the shell syntax parsing – Unix shells assume space to be a delimiter of various elements in a list, so when you do a for loop, Unix shell interpreter treats each line with spaces as a list of elements, splits it and returns elements one by one.

How to read text files line by line

Here's what I decided: if I can't make Unix shell ignore the spaces between words of each line of text, I'll disguise these spaces. Since my solution was getting pretty bulky for a one-liner, I've made it into a script. Here it is:

bash-3.1$ cat /tmp/cat.sh
#!/bin/sh
FILE=$1
UNIQUE='-={GR}=-'
#
if [ -z "$FILE" ]; then
        exit;
fi;
#
for LINE in `sed "s/ /$UNIQUE/g" $FILE`; do
        LINE=`echo $LINE | sed "s/$UNIQUE/ /g"`;
        echo $LINE;
done;

As you can see, I've introduced an idea of a UNIQUE variable, something containing a unique combination of characters which I can use to replace spaces in the original string. This variable needs to be a unique combination in a context of your text files, because later we turn the string back into its original version, replacing all the instances of $UNIQUE text with plain spaces.

Since most of the needs of mine required such functionality for a relatively small text files, this rather expensive (in terms of CPU cycles) approach proved to be quite usable and pretty fast.

Update: please see comments to this post for a much better approach to the same problem. Thanks again, Nails!

Here's how my script would work on the already known /etc/redhat-release file:

bash-3.1$ /tmp/cat.sh /etc/redhat-release
Red Hat Enterprise Linux Client release 5 (Tikanga)

Exactly what I wanted! Hopefully this little trick will save some of your time as well. Let me know if you like it or know an even better one yourself!

Related books

If you want to learn more, here's a great book:

classic-shell-scripting — Classic Shell Scripting

Comments

n00b says

September 8, 2008 at 11:19 am

Just what I was looking for! Thanks!
Gleb Reys says

September 8, 2008 at 11:29 am

I wish I thought about this a few years earlier – so many scripts of mine could be much better!
Nails Carmody says

September 12, 2008 at 11:38 pm

I don't mean to be rude of condescending, but you are trying to solve a problem that doesn't exist. While your UNIQUE variable idea is clever, why don't you just use while loop:

while read LINE
do
echo "$LINE"
done < /etc/redhat-release

Also, your `cat $1` is often referred to as a UUOC. I found this link to be very instructional:

http://partmaps.org/era/unix/award.html

Sorry I had to disagree with you.

Regards,

Nails
Gleb Reys says

September 13, 2008 at 12:16 am

Nails, thanks for finding the courage to speak up! I'm glad you recognize the thinking pattern I've followed (trying to cat a file and expecting a line at a time)!

I'm glad you brought the `cat $1` part up too, not only should it be `cat $FILE` in my particular example, but I have never heard about the UUOC, so look forward to reading a whole page about it.

THANK YOU and please comment on anything else in the future – like I said somewhere, I'm not trying to look all-knowing, and I welcome any opportunity to learn and improve myself.
zenith191 says

August 4, 2009 at 5:17 pm

One problem with Nails solution is that it removes leading whitespace. This causes loss of indentation.
ming says

August 24, 2009 at 5:46 am

why not use read line
for example
cat aFile| while read line; do echo $line; done
Heba says

May 7, 2010 at 1:09 pm

manyy thanks ming,, it worked with me..
P says

October 31, 2011 at 11:11 am

I am facing problem with while read line on content having lot of '\' as the while loop ignores this character for example my file has:
ad\bc\gf\ufg, xyz
aw\ne\hw\iue\dkiue, sde
which is read by while read line loop as…
adbcgfufg, xyz
awnehwiuedkiue, sde

Any one has any idea how to over come this ?

How To Parse Text Files Line by Line in Unix scripts

Reading text files in Unix shell

How to read text files line by line

Related books

See also:

Comments

Leave a Reply

Advanced Topics

Basic Commands

Unix Reference

Reading text files in Unix shell

How to read text files line by line

Related books

See also:

Reader Interactions

Comments

Leave a Reply

Footer

Advanced Topics

Basic Commands

Unix Reference