Should I care about unnecessary cats?
Solution 1
The "definitive" answer is of course brought to you by The Useless Use of cat
Award.
The purpose of cat is to concatenate (or "catenate") files. If it's only one file, concatenating it with nothing at all is a waste of time, and costs you a process.
Instantiating cat just so your code reads differently makes for just one more process and one more set of input/output streams that are not needed. Typically the real hold-up in your scripts is going to be inefficient loops and actuall processing. On most modern systems, one extra cat
is not going to kill your performance, but there is almost always another way to write your code.
Most programs, as you note, are able to accept an argument for the input file. However, there is always the shell builtin <
that can be used wherever a STDIN stream is expected which will save you one process by doing the work in the shell process that is already running.
You can even get creative with WHERE you write it. Normally it would be placed at the end of a command before you specify any output redirects or pipes like this:
sed s/blah/blaha/ < data | pipe
But it doesn't have to be that way. It can even come first. For instance your example code could be written like this:
< data \
sed s/bla/blaha/ |
grep blah |
grep -n babla
If script readability is your concern and your code is messy enough that adding a line for cat
is expected to make it easier to follow, there are other ways to clean up your code. One that I use a lot that helps make scripts easiy to figure out later is breaking up pipes into logical sets and saving them in functions. The script code then becomes very natural, and any one part of the pipline is easier to debug.
function fix_blahs () {
sed s/bla/blaha/ |
grep blah |
grep -n babla
}
fix_blahs < data
You could then continue with fix_blahs < data | fix_frogs | reorder | format_for_sql
. A pipleline that reads like that is really easy to follow, and the individual components can be debuged easily in their respective functions.
Solution 2
Here's a summary of some of the drawbacks of:
cat $file | cmd
over
< $file cmd
-
First, a note: there are (intentionally for the purpose of the discussion) missing double quotes around
$file
above. In the case ofcat
, that's always a problem except forzsh
; in the case of the redirection, that's only a problem forbash
orksh88
and, for some other shells (includingbash
in POSIX mode) only when interactive (not in scripts). -
The most often cited drawback is the extra process being spawned. Note that if
cmd
is builtin, that's even 2 processes in some shells likebash
. -
Still on the performance front, except in shells where
cat
is builtin, that also an extra command being executed (and of course loaded, and initialised (and the libraries it's linked to as well)). -
Still on the performance front, for large files, that means the system will have to alternately schedule the
cat
andcmd
processes and constantly fill up and empty the pipe buffer. Even ifcmd
does1GB
largeread()
system calls at a time, control will have to go back and forth betweencat
andcmd
because a pipe can't hold more than a few kilobytes of data at a time. -
Some
cmd
s (likewc -c
) can do some optimisations when their stdin is a regular file which they can't do withcat | cmd
as their stdin is just a pipe then. Withcat
and a pipe, it also means they cannotseek()
within the file. For commands liketac
ortail
, that makes a huge difference in performance as that means that withcat
they need to store the whole input in memory. -
The
cat $file
, and even its more correct versioncat -- "$file"
won't work properly for some specific file names like-
(or--help
or anything starting with-
if you forget the--
). If one insists on usingcat
, he should probably usecat < "$file" | cmd
instead for reliability. -
If
$file
cannot be open for reading (access denied, doesn't exist...),< "$file" cmd
will report a consistent error message (by the shell) and not runcmd
, whilecat $file | cmd
will still runcmd
but with its stdin looking like it's an empty file. That also means that in things like< file cmd > file2
,file2
is not clobbered iffile
can't be opened.Or in other words you can choose the order in which the input and output files are opened as opposed to
cmd file > file2
where the output file is always opened (by the shell) before the input file (bycmd
), which is hardly ever preferable.Note however that it won't help in
cmd1 < file | cmd2 > file2
wherecmd1
andcmd2
and their redirections are performed concurrently and independently and which you'd need to write as{ cmd1 | cmd2; } < file > file2
or(cmd1 | cmd2 > file2) < file
for instance to avoidfile2
being clobbered andcmd1
andcmd2
being run iffile
can't be opened.
Solution 3
Putting <file
on the end of a pipeline is less readable than having cat file
at the start. Natural English reads from left to right.
Putting <file
a the start of the pipeline is also less readable than cat, I would say. A word is more readable than a symbol, especially a symbol which seems to point the wrong way.
Using cat
preserves the command | command | command
format.
Solution 4
One thing that the other answers here don't seem to have directly addressed is that using cat
like this isn't "useless" in the sense that "an extraneous cat process is spawned that does no work"; it's useless in the sense that "a cat process is spawned that does only unnecessary work".
In the case of these two:
sed 's/foo/bar/' somefile
<somefile sed 's/foo/bar/'
the shell starts a sed process that reads from somefile or stdin (respectively) and then does some processing - it reads up until it hits a newline, replaces the first 'foo' (if any) on that line with 'bar', then prints that line to stdout and loops.
In the case of:
cat somefile | sed 's/foo/bar/'
The shell spawns a cat process and a sed process, and wires cat's stdout to sed's stdin. The cat process reads a several kilo- or maybe mega- byte chunk out of the file, then writes that out to its stdout, where the sed sommand picks up from there as in the second example above. While sed is processing that chunk, cat is reading another chunk and writing it to its stdout for sed to work on next.
In other words, the extra work necessitated by adding the cat
command isn't just the extra work of spawning an extra cat
process, it's also the extra work of reading and writing the bytes of the file twice instead of once. Now, practically speaking and on modern systems, that doesn't make a huge difference - it may make your system do a few microseconds of unnecessary work. But if it's for a script that you plan on distributing, potentially to people using it on machines that are already underpowered, a few microseconds can add up over a lot of iterations.
Related videos on Youtube
tshepang
I do software development for a living and as a hobby. My favorite language is Rust, and I've used Python much in the past. My OS of choice is Debian.
Updated on September 18, 2022Comments
-
tshepang almost 2 years
A lot of command-line utilities can take their input either from a pipe or as a filename argument. For long shell scripts, I find starting the chain off with a
cat
makes it more readable, especially if the first command would need multi-line arguments.Compare
sed s/bla/blaha/ data \ | grep blah \ | grep -n babla
and
cat data \ | sed s/bla/blaha/ \ | grep blah \ | grep -n babla
Is the latter method less efficient? If so, is the difference enough to care about if the script is run, say, once a second? The difference in readability is not huge.
-
Michael Mrozek almost 13 yearsI spend way more time watching people attack each other about useless cat usage on this site than my system does actually starting the cat processes
-
tcoolspy almost 13 years@Michael: 100% agree. Heck it took me more time to link to the old usenet award once than my computer will ever waste instantiating
cat
. However I think the bigger question here is code readability which often is a priority over performance. When faster can actually be written prettier, why not? Pointing out the issue withcat
usually leads to the user having a better understanding of pipelines and processes in general. It's worth the effort so they write comprehensible code next time around. -
Cascabel almost 13 yearsI actually have another reason I don't like the first form - if you want to add another command at the beginning of the pipeline, you have to move the argument too, so the editing is more annoying. (Of course, this doesn't mean you have to use
cat
; Caleb's point about using functions and redirection solves that as well.) -
G-Man Says 'Reinstate Monica' almost 9 yearsRelated: Remove useless-uses-of-cat or not? (Meta)
-
CoOl over 7 yearsIt's evening on the job, my had is refusing to work. I open stackoverflow and find a question, titled "Should I care about unnecessary cats?" and see some homeless animals and a programmer, pondering about feeding them or not...
-
-
tcoolspy almost 13 years@Tim: Bash and Zsh both support that, although I think it's ugly. When I'm worried about my code being pretty and maintainable I usually use functions to clean it up. See my last edit.
-
Gilles 'SO- stop being evil' almost 13 yearsThere are cases where
cat
with ≤1 argument is useful.$(cat /some/file)
is an obvious one. Another iscat
at the end of a pipeline, to force a command to output to a pipe rather than a regular file or terminal. -
Gilles 'SO- stop being evil' almost 13 years@Tim
<file
can come anywhere on the command line:<file grep needle
orgrep <file needle
orgrep needle <file
. The exception is complex commands such as loops and groupings; there the redirection must come after the closingdone
/}
/)
/etc. @Caleb This holds in all Bourne/POSIX shells. And I disagree that it's ugly. -
cjm almost 13 years@Gilles, in bash you can replace
$(cat /some/file)
with$(< /some/file)
, which does the same thing but avoids spawning a process. -
Richard Fortune over 11 yearsJust to confirm that
$(< /some/file)
is of limited portability. It does work in bash, but not BusyBox ash, for example, or FreeBSD sh. Probably doesn't work in dash either, since those last three shells are all close cousins. -
ghoti about 10 years@Gilles, actually, arbitrary redirection locations within the command line work in csh and tcsh (at least on FreeBSD and OSX) as well, which are neither Bourne nor necessarily POSIX.
-
Sarah G about 9 years"Costs you a process" pretty much sums up the answer here. If the cost of a process is going to substantially impact your available resources, you have a good reason to be concerned about an extra
cat
. If you're not on decades-old hardware or running your script in a tight loop, this is really just about aesthetics and has no relation to any actual engineering concerns. -
Bor over 8 yearsThe link is not working.
-
Reveur over 8 yearsYes I know about aliases. However, although this alias replaces the symbol with a word, it requires the reader to know about your personal alias setting, so is not very portable.
-
Ole Tange almost 8 yearsSee oletange.blogspot.dk/2013/10/useless-use-of-cat.html for a test of the overhead of using the additional
cat
. -
Ole Tange almost 8 yearsRegarding the performance: This test shows the difference is in the order of 1 pct unless you are doing very little processing on the stream oletange.blogspot.dk/2013/10/useless-use-of-cat.html
-
Stéphane Chazelas almost 8 years@OleTange. Here's another test:
truncate -s10G a; time wc -c < a; time cat a | wc -c; time cat a | cat | wc -c
. There are a lot of parameters that get into the picture. The performance penalty can go from 0 to 100%. In any case, I don't think the penalty can be negative. -
Ole Tange almost 8 years
wc -c
is a pretty unique case, because it has a shortcut. If you instead dowc -w
then it is comparable togrep
in my example (i.e. very little processing - which is the situation where '<' can make a difference). -
Peter Cordes about 7 years@SarahG: Even on modern systems
cat
hurts a lot withtail
,wc -c
, or something else that benefits a lot from having a regular or seekable file as its stdin. (See Stéphane Chazelas's answer to this question.) The only time a separatecat
process can help you is with a special file like/dev/urandom
where it takes a lot of CPU time just to read it, andcat
puts that work in a separate process. -
Stéphane Chazelas almost 7 years@OleTange, even (
wc -w
on a 1GB sparse file in the C locale on linux 4.9 amd64) then I find the cat approach takes 23% more time when on a multicore system and 5% when binding them to one core. Showing the extra overhead incurred by having data accessed by more that one core. You'll possibly get different results if you change the size of the pipe, use different data, involve real I/O use a cat implementation that uses splice()... All confirming that there are a lot of parameters getting in the picture and that in any casecat
won't help. -
G-Man Says 'Reinstate Monica' over 6 years@OleTange: I just stumbled across this, and visited your blog. (1) While I see the content (mostly) in English, I see a bunch of words in (I guess) Danish: “Klassisk”, “Flipcard”, “Magasin”, “Mosaik”, “Sidebjælke”, “Øjebliksbillede”, “Tidsskyder”, “Blog-arkiv”, “Om mig”, “Skrevet”, and “Vis kommentarer” (but “Tweet”, “Like”, and the cookies banner are in English). Did you know about this, and is it under your control? (2) I have trouble reading your tables (2a) because the gridlines are incomplete, and (2b) I don’t understand what you mean by “Diff (pct)”.
-
Ole Tange over 6 yearsblogspot.dk is run by Google. Try replacing with blogspot.com. The "Diff (pct)" is the ms with
cat
divided by the ms withoutcat
in percent (e.g. 264 ms/216 ms = 1.22 = 122% = 22% slower withcat
) -
rogerdpack over 4 yearsFor me with a 1GB file
wc -w
it's a difference of about 2%...15% difference if it's into a straight simple grep. Then, weirdly, if it's on an NFS file share it's actually 20% faster to read it if piped fromcat
(gist.github.com/rdp/7162414833becbee5919cda855f1cb86) Weird... -
Ole Tange over 3 yearsRe:
< file cmd > file2
True, but this would clobber file2:cmd > file2 < file
, and I think it is dangerous practice to build code, where the order in which you put the redirection would spell the difference between success and disaster. -
Stéphane Chazelas over 3 years@OleTange, how is that different from
open STDOUT, ">", "file2" or die; open STDIN, "<", "file1" or die;
in perl for instance? The point is you can choose the order in which files are opened. -
Ole Tange over 3 years@StéphaneChazelas It is different in the way that I have heard multiple times: "You can put the redirection anywhere in the command line" (except it cannot be at the start in
fish
) - but it is never followed up with: "But the order matters in some shells (not csh for example)". You suddenly get into a pretty long and messy explanation that is impossible for newbies to grasp. I have never heard anything similar about perl.