unix - split a huge .gz file by line
Solution 1
How to do this best depends on what you want:
- Do you want to extract a single part of the large file?
- Or do you want to create all the parts in one go?
If you want a single part of the file, your idea to use gunzip
and head
is right. You can use:
gunzip -c hugefile.txt.gz | head -n 4000000
That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.
To get the other parts, you'd use a combination of head
and tail
, like:
gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000
to get the second block.
Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped
No, the gunzip -c
does not require any disk space - it does everything in memory, then streams it out to stdout.
If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split
; see jim mcnamara's answer for details.
Solution 2
pipe to split use either gunzip -c or zcat to open the file
gunzip -c bigfile.gz | split -l 400000
Add output specifications to the split command.
Solution 3
As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.
zcat hugefile.txt.gz | head -n 40000000
zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000
zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000
Solution 4
Directly split .gz file to .gz files:
zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'
I think this is what OP wanted, because he don't have much space.
Solution 5
I'd consider using split.
split a file into pieces
Related videos on Youtube
Rajeshkumar
Updated on September 18, 2022Comments
-
Rajeshkumar over 1 year
I'm sure someone has had the below need, what is a quick way of splitting a huge .gz file by line? The underlying text file has 120million rows. I don't have enough disk space to gunzip the entire file at once so I was wondering if someone knows of a bash/perl script or tool that could split the file (either the .gz or inner .txt) into 3x 40mn line files. ie calling it like:
bash splitter.sh hugefile.txt.gz 4000000 1 would get lines 1 to 40 mn bash splitter.sh hugefile.txt.gz 4000000 2 would get lines 40mn to 80 mn bash splitter.sh hugefile.txt.gz 4000000 3 would get lines 80mn to 120 mn
Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000
Note: I can't get extra disk.
Thanks!
-
Admin over 12 yearsDo you want the resulting files to be gziped again?
-
Ingo over 12 yearsYou can use gunzip in a ipe. The rest can be done with head and tail
-
Rajeshkumar over 12 years@Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz
-
sleske over 12 years@toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.
-
b0fh over 9 yearsThe accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).
-
-
Alois Mahdal about 12 yearsFrom performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?
-
sleske about 12 years@AloisMahdal: Actually, that would be a good separate question :-). Short version:
gzip
does not know about the limit (which comes from a different process). Ifhead
is used,head
will exit when it has received enough, and this will propagate togzip
(via SIGPIPE, see Wikipedia). Fortail
this is not possible, so yes,gzip
will decompress everything. -
sleske about 12 yearsBut if you are interested, you should really ask this as a separate question.
-
b0fh over 9 yearsThis is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.
-
sleske over 7 years@b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).
-
Stephen Blum about 6 yearsBest answer for sure.
-
Quetzalcoatl over 5 yearswhat are the output specs so that the outputs are .gz files themselves?