unix - split a huge .gz file by line

20,635

Solution 1

How to do this best depends on what you want:

  • Do you want to extract a single part of the large file?
  • Or do you want to create all the parts in one go?

If you want a single part of the file, your idea to use gunzip and head is right. You can use:

gunzip -c hugefile.txt.gz | head -n 4000000

That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.

To get the other parts, you'd use a combination of head and tail, like:

gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000

to get the second block.

Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped

No, the gunzip -c does not require any disk space - it does everything in memory, then streams it out to stdout.


If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split; see jim mcnamara's answer for details.

Solution 2

pipe to split use either gunzip -c or zcat to open the file

gunzip -c bigfile.gz | split -l 400000

Add output specifications to the split command.

Solution 3

As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.

zcat hugefile.txt.gz | head -n 40000000
zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000
zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000

Solution 4

Directly split .gz file to .gz files:

zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'

I think this is what OP wanted, because he don't have much space.

Solution 5

I'd consider using split.

split a file into pieces

Share:
20,635

Related videos on Youtube

Rajeshkumar
Author by

Rajeshkumar

Updated on September 18, 2022

Comments

  • Rajeshkumar
    Rajeshkumar over 1 year

    I'm sure someone has had the below need, what is a quick way of splitting a huge .gz file by line? The underlying text file has 120million rows. I don't have enough disk space to gunzip the entire file at once so I was wondering if someone knows of a bash/perl script or tool that could split the file (either the .gz or inner .txt) into 3x 40mn line files. ie calling it like:

        bash splitter.sh hugefile.txt.gz 4000000 1
     would get lines 1 to 40 mn    
        bash splitter.sh hugefile.txt.gz 4000000 2
    would get lines 40mn to 80 mn
        bash splitter.sh hugefile.txt.gz 4000000 3
    would get lines 80mn to 120 mn
    

    Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000

    Note: I can't get extra disk.

    Thanks!

    • Admin
      Admin over 12 years
      Do you want the resulting files to be gziped again?
    • Ingo
      Ingo over 12 years
      You can use gunzip in a ipe. The rest can be done with head and tail
    • Rajeshkumar
      Rajeshkumar over 12 years
      @Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz
    • sleske
      sleske over 12 years
      @toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.
    • b0fh
      b0fh over 9 years
      The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).
  • Alois Mahdal
    Alois Mahdal about 12 years
    From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?
  • sleske
    sleske about 12 years
    @AloisMahdal: Actually, that would be a good separate question :-). Short version: gzip does not know about the limit (which comes from a different process). If head is used, head will exit when it has received enough, and this will propagate to gzip (via SIGPIPE, see Wikipedia). For tail this is not possible, so yes, gzip will decompress everything.
  • sleske
    sleske about 12 years
    But if you are interested, you should really ask this as a separate question.
  • b0fh
    b0fh over 9 years
    This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.
  • sleske
    sleske over 7 years
    @b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).
  • Stephen Blum
    Stephen Blum about 6 years
    Best answer for sure.
  • Quetzalcoatl
    Quetzalcoatl over 5 years
    what are the output specs so that the outputs are .gz files themselves?