Less resource hungry alternative for piping `cat` into gzip for huge files

memory pipe cpu cat gzip

9,773

Solution 1

cat doesn't use any significant CPU time (unless maybe on-disk decryption or decompression is involved and accounted to the cat process which is the one reading from disk) or memory. It just reads the content of the files and writes it to the pipe in small chunks in a loop.

However, here, you don't need it. You can just do:

gzip -c file1 file2 file3 file4 > compress.gz

(not that it will make a significant difference).

You can lower the priority of the gzip process (wrt CPU scheduling) with the nice command. Some systems have an ionice command for the same with I/O.

nice -n 19 ionice -c idle pigz -c file1 file2 file3 file4 > compress.gz

On Linux would run a parallel version of gzip with as little impact on the system as possible.

Having compress.gz on a different disk (if using rotational storage) would make it more efficient.

The system may cache that data that cat or gzip/pigz reads in memory if it has memory available to do so. It does that in case you need that data again. In the process, it may evict other cached data that is more useful. Here, that data likely doesn't need to be available.

With GNU dd, you can use iflag=nocache to advise the system not to cache the data:

for file in file1 file2 file3 file4; do
  ionice -c idle dd bs=128k status=none iflag=nocache < "$file"
done | nice pigz > compress.gz

Solution 2

If you want to stretch the process out without using too much resource then try modifying the scheduling priority by changing the nice value.:

nice -n 19 cat file1 file2 file3 file4 | gzip > compress.gz

man nice

  -n, --adjustment=N
         add integer N to the niceness (default 10)

You can also regulate the gzip speed which may be worth investigating (--best)

Other methods are available, but will keep the files seperate:

If you are happy to use the tar archive format then you can use the zip argument to zip uo the contents on the fly, however, these may keep the processing high:

tar zcvf compress.tgz file[1234]

Or you can use zip which can deal with multiple files:

zip compress.zip file[1234]

9,773

Skyler

Just some user who has some questions.

Updated on September 18, 2022

Comments

Skyler almost 2 years
I have some files of which some are very large (like several GB), which I need to concatenate to one big file and then zip it, so something like this:
```
cat file1 file2 file3 file4 | gzip > compress.gz
```
which produces extremly high CPU and memory load on the machine or even makes it crash, because the cat generates several GB.

I can't use tar archives, I really need one big chunk compressed by gzip.

How can I produce the same gz file in a sequential way, so that I don't have to cat several GB first, but still have all files in the same .gz in the end?
- wurtel over 9 years
  
  cat doesn't first buffer everything before sending data to the pipe; it reads a buffer that will fit into the pipe (typically 32kB) and then writes that. Then it reads the next 32k, writes it, etc. If end-of-file is found on the first file, then the next is opened and writing to the pipe continues as usual. So it's probably just gzip that is consuming the CPU; I don't believe that memory will be exhausted by this. If your system crashes (as in a panic or oops) then you have other problems.
- Gilles 'SO- stop being evil' over 9 years
  
  What makes you think cat is using significant resources? It doesn't. It just copies its input to its output. You're probably looking at the wrong thing. Tell us in detail what you observed.
Skyler over 9 years

I need the concatenated files in the zip as one resulting file. Like I would cat them to a single file first and then compress this single file. I can't use a tar because an automated process on another machine (not in my hand) needs a plain .gz containing the concatenated single big file.
geedoubleya over 9 years

@Foo I have modified my answer to include an compressed file containing a concatenated file.
Celada over 9 years

Note that gzip -c file1 file2 is not equivalent to cat file1 file2 | gzip. According to the manpage, in the former case "the output consists of a sequence of independently compressed members". In the latter case, the concatenated input is compressed to a single compressed object. The manpage even goes on to say "To obtain better compression, concatenate all input files before compressing them." so I would recommend that the OP not switch to gzip -c etc... just for the sake of dropping cat which is cheap anyway.