Less resource hungry alternative for piping `cat` into gzip for huge files
Solution 1
cat
doesn't use any significant CPU time (unless maybe on-disk decryption or decompression is involved and accounted to the cat
process which is the one reading from disk) or memory. It just reads the content of the files and writes it to the pipe in small chunks in a loop.
However, here, you don't need it. You can just do:
gzip -c file1 file2 file3 file4 > compress.gz
(not that it will make a significant difference).
You can lower the priority of the gzip
process (wrt CPU scheduling) with the nice
command. Some systems have an ionice
command for the same with I/O.
nice -n 19 ionice -c idle pigz -c file1 file2 file3 file4 > compress.gz
On Linux would run a parallel version of gzip
with as little impact on the system as possible.
Having compress.gz
on a different disk (if using rotational storage) would make it more efficient.
The system may cache that data that cat
or gzip/pigz
reads in memory if it has memory available to do so. It does that in case you need that data again. In the process, it may evict other cached data that is more useful. Here, that data likely doesn't need to be available.
With GNU dd
, you can use iflag=nocache
to advise the system not to cache the data:
for file in file1 file2 file3 file4; do
ionice -c idle dd bs=128k status=none iflag=nocache < "$file"
done | nice pigz > compress.gz
Solution 2
If you want to stretch the process out without using too much resource then try modifying the scheduling priority by changing the nice
value.:
nice -n 19 cat file1 file2 file3 file4 | gzip > compress.gz
man nice
-n, --adjustment=N add integer N to the niceness (default 10)
You can also regulate the gzip speed which may be worth investigating (--best
)
Other methods are available, but will keep the files seperate:
If you are happy to use the tar
archive format then you can use the zip
argument to zip uo the contents on the fly, however, these may keep the processing high:
tar zcvf compress.tgz file[1234]
Or you can use zip
which can deal with multiple files:
zip compress.zip file[1234]
Related videos on Youtube
Comments
-
Skyler almost 2 years
I have some files of which some are very large (like several GB), which I need to concatenate to one big file and then zip it, so something like this:
cat file1 file2 file3 file4 | gzip > compress.gz
which produces extremly high CPU and memory load on the machine or even makes it crash, because the
cat
generates several GB.I can't use tar archives, I really need one big chunk compressed by gzip.
How can I produce the same gz file in a sequential way, so that I don't have to
cat
several GB first, but still have all files in the same .gz in the end?-
wurtel over 9 years
cat
doesn't first buffer everything before sending data to the pipe; it reads a buffer that will fit into the pipe (typically 32kB) and then writes that. Then it reads the next 32k, writes it, etc. If end-of-file is found on the first file, then the next is opened and writing to the pipe continues as usual. So it's probably just gzip that is consuming the CPU; I don't believe that memory will be exhausted by this. If your system crashes (as in a panic or oops) then you have other problems. -
Gilles 'SO- stop being evil' over 9 yearsWhat makes you think
cat
is using significant resources? It doesn't. It just copies its input to its output. You're probably looking at the wrong thing. Tell us in detail what you observed.
-
-
Skyler over 9 yearsI need the concatenated files in the zip as one resulting file. Like I would cat them to a single file first and then compress this single file. I can't use a tar because an automated process on another machine (not in my hand) needs a plain .gz containing the concatenated single big file.
-
geedoubleya over 9 years@Foo I have modified my answer to include an compressed file containing a concatenated file.
-
Celada over 9 yearsNote that
gzip -c file1 file2
is not equivalent tocat file1 file2 | gzip
. According to the manpage, in the former case "the output consists of a sequence of independently compressed members". In the latter case, the concatenated input is compressed to a single compressed object. The manpage even goes on to say "To obtain better compression, concatenate all input files before compressing them." so I would recommend that the OP not switch togzip -c etc...
just for the sake of droppingcat
which is cheap anyway.