I'm finally back from my holidays and thrilled to be sharing next of my Unix tips with you!
Today I'd like to talk about parsing text files in Unix shell scripts. This is one of the really popular areas of scripting, and there's a few quite typical limitations which everyone comes across.
Reading text files in Unix shell
If we agree that by "reading a text file" we assume a procedure of going through all the lines found in a clear text file with a view to somehow process the data, then cat command would be the simplest demonstration of such procedure:
redhat$ cat /etc/redhat-release Red Hat Enterprise Linux Client release 5 (Tikanga)
As you can see, there's only one line in the /etc/redhat-release file, and we see what this line is.
But if you for whatever reason wanted to read this file from a script and assign the whole release information line to a Unix variable, using cat output would not work as expected:
bash-3.1$ for i in `cat /etc/redhat-release`; do echo $i; done; RedHat Enterprise Linux Client release 5 (Tikanga)
Instead of reading a line of text from the file, our one-liner splits the line and outputs every word on a separate line of the output. This happens because of the shell syntax parsing – Unix shells assume space to be a delimiter of various elements in a list, so when you do a for loop, Unix shell interpreter treats each line with spaces as a list of elements, splits it and returns elements one by one.
How to read text files line by line
Here's what I decided: if I can't make Unix shell ignore the spaces between words of each line of text, I'll disguise these spaces. Since my solution was getting pretty bulky for a one-liner, I've made it into a script. Here it is:
bash-3.1$ cat /tmp/cat.sh #!/bin/sh FILE=$1 UNIQUE='-={GR}=-' # if [ -z "$FILE" ]; then exit; fi; # for LINE in `sed "s/ /$UNIQUE/g" $FILE`; do LINE=`echo $LINE | sed "s/$UNIQUE/ /g"`; echo $LINE; done;
As you can see, I've introduced an idea of a UNIQUE variable, something containing a unique combination of characters which I can use to replace spaces in the original string. This variable needs to be a unique combination in a context of your text files, because later we turn the string back into its original version, replacing all the instances of $UNIQUE text with plain spaces.
Since most of the needs of mine required such functionality for a relatively small text files, this rather expensive (in terms of CPU cycles) approach proved to be quite usable and pretty fast.
Update: please see comments to this post for a much better approach to the same problem. Thanks again, Nails!
Here's how my script would work on the already known /etc/redhat-release file:
bash-3.1$ /tmp/cat.sh /etc/redhat-release Red Hat Enterprise Linux Client release 5 (Tikanga)
Exactly what I wanted! Hopefully this little trick will save some of your time as well. Let me know if you like it or know an even better one yourself!
Related books
If you want to learn more, here's a great book:
n00b says
Just what I was looking for! Thanks!
Gleb Reys says
I wish I thought about this a few years earlier – so many scripts of mine could be much better!
Nails Carmody says
I don't mean to be rude of condescending, but you are trying to solve a problem that doesn't exist. While your UNIQUE variable idea is clever, why don't you just use while loop:
while read LINE
do
echo "$LINE"
done < /etc/redhat-release
Also, your `cat $1` is often referred to as a UUOC. I found this link to be very instructional:
http://partmaps.org/era/unix/award.html
Sorry I had to disagree with you.
Regards,
Nails
Gleb Reys says
Nails, thanks for finding the courage to speak up! I'm glad you recognize the thinking pattern I've followed (trying to cat a file and expecting a line at a time)!
I'm glad you brought the `cat $1` part up too, not only should it be `cat $FILE` in my particular example, but I have never heard about the UUOC, so look forward to reading a whole page about it.
THANK YOU and please comment on anything else in the future – like I said somewhere, I'm not trying to look all-knowing, and I welcome any opportunity to learn and improve myself.
zenith191 says
One problem with Nails solution is that it removes leading whitespace. This causes loss of indentation.
ming says
why not use read line
for example
cat aFile| while read line; do echo $line; done
Heba says
manyy thanks ming,, it worked with me..
P says
I am facing problem with while read line on content having lot of '\' as the while loop ignores this character for example my file has:
ad\bc\gf\ufg, xyz
aw\ne\hw\iue\dkiue, sde
which is read by while read line loop as…
adbcgfufg, xyz
awnehwiuedkiue, sde
Any one has any idea how to over come this ?