cat line X to line Y on a huge file
Solution 1
I suggest the sed
solution, but for the sake of completeness,
awk 'NR >= 57890000 && NR <= 57890010' /path/to/file
To cut out after the last line:
awk 'NR < 57890000 { next } { print } NR == 57890010 { exit }' /path/to/file
Speed test (here on macOS, YMMV on other systems):
- 100,000,000-line file generated by
seq 100000000 > test.in
- Reading lines 50,000,000-50,000,010
- Tests in no particular order
-
real
time as reported bybash
's builtintime
4.373 4.418 4.395 tail -n+50000000 test.in | head -n10
5.210 5.179 6.181 sed -n '50000000,50000010p;57890010q' test.in
5.525 5.475 5.488 head -n50000010 test.in | tail -n10
8.497 8.352 8.438 sed -n '50000000,50000010p' test.in
22.826 23.154 23.195 tail -n50000001 test.in | head -n10
25.694 25.908 27.638 ed -s test.in <<<"50000000,50000010p"
31.348 28.140 30.574 awk 'NR<57890000{next}1;NR==57890010{exit}' test.in
51.359 50.919 51.127 awk 'NR >= 57890000 && NR <= 57890010' test.in
These are by no means precise benchmarks, but the difference is clear and repeatable enough* to give a good sense of the relative speed of each of these commands.
*: Except between the first two, sed -n p;q
and head|tail
, which seem to be essentially the same.
Solution 2
If you want lines X to Y inclusive (starting the numbering at 1), use
tail -n "+$X" /path/to/file | head -n "$((Y-X+1))"
tail
will read and discard the first X-1 lines (there's no way around that), then read and print the following lines. head
will read and print the requested number of lines, then exit. When head
exits, tail
receives a SIGPIPE signal and dies, so it won't have read more than a buffer size's worth (typically a few kilobytes) of lines from the input file.
Alternatively, as gorkypl suggested, use sed:
sed -n -e "$X,$Y p" -e "$Y q" /path/to/file
The sed solution is significantly slower though (at least for GNU utilities and Busybox utilities; sed might be more competitive if you extract a large part of the file on an OS where piping is slow and sed is fast). Here are quick benchmarks under Linux; the data was generated by seq 100000000 >/tmp/a
, the environment is Linux/amd64, /tmp
is tmpfs and the machine is otherwise idle and not swapping.
real user sys command
0.47 0.32 0.12 </tmp/a tail -n +50000001 | head -n 10 #GNU
0.86 0.64 0.21 </tmp/a tail -n +50000001 | head -n 10 #BusyBox
3.57 3.41 0.14 sed -n -e '50000000,50000010 p' -e '50000010q' /tmp/a #GNU
11.91 11.68 0.14 sed -n -e '50000000,50000010 p' -e '50000010q' /tmp/a #BusyBox
1.04 0.60 0.46 </tmp/a tail -n +50000001 | head -n 40000001 >/dev/null #GNU
7.12 6.58 0.55 </tmp/a tail -n +50000001 | head -n 40000001 >/dev/null #BusyBox
9.95 9.54 0.28 sed -n -e '50000000,90000000 p' -e '90000000q' /tmp/a >/dev/null #GNU
23.76 23.13 0.31 sed -n -e '50000000,90000000 p' -e '90000000q' /tmp/a >/dev/null #BusyBox
If you know the byte range you want to work with, you can extract it faster by skipping directly to the start position. But for lines, you have to read from the beginning and count newlines. To extract blocks from x inclusive to y exclusive starting at 0, with a block size of b:
dd bs="$b" seek="$x" count="$((y-x))" </path/to/file
Solution 3
The head | tail
approach is one of the best and most "idiomatic" ways to do this:
X=57890000
Y=57890010
< infile.txt head -n "$Y" | tail -n +"$X"
As pointed out by Gilles in the comments, a faster way is
< infile.txt tail -n +"$X" | head -n "$((Y - X))"
The reason this is faster is the first X - 1 lines don't need to go through the pipe compared to the head | tail
approach.
Your question as phrased is a bit misleading and probably explains some of your unfounded misgivings towards this approach.
You say you have to calculate
A
,B
,C
,D
but as you can see, the line count of the file is not needed and at most 1 calculation is necessary, which the shell can do for you anyways.You worry that piping will read more lines than necessary. In fact this is not true:
tail | head
is about as efficient as you can get in terms of file I/O. First, consider the minimum amount of work necessary: to find the X'th line in a file, the only general way to do it is to read every byte and stop when you count X newline symbols as there is no way to divine the file offset of the X'th line. Once you reach the *X*th line, you have to read all the lines in order to print them, stopping at the Y'th line. Thus no approach can get away with reading less than Y lines. Now,head -n $Y
reads no more than Y lines (rounded to the nearest buffer unit, but buffers if used correctly improve performance, so no need to worry about that overhead). In addition,tail
will not read any more thanhead
, so thus we have shown thathead | tail
reads the fewest number of lines possible (again, plus some negligible buffering that we are ignoring). The only efficiency advantage of a single tool approach that does not use pipes is fewer processes (and thus less overhead).
Solution 4
The most orthodox way (but not the fastest, as noted by Gilles above) would be to use sed
.
In your case:
X=57890000
Y=57890010
sed -n -e "$X,$Y p" -e "$Y q" filename
The -n
option implies that only the relevant lines are printed to stdout.
The p at the end of finishing line number means to print lines in given range. The q in second part of the script saves some time by skipping the remainder of the file.
Solution 5
I do this often enough and so wrote this script. I don't need to find the line numbers, the script does it all.
#!/bin/bash
# $1: start time
# $2: end time
# $3: log file to read
# $4: output file
# i.e. log_slice.sh 18:33 19:40 /var/log/my.log /var/log/myslice.log
if [[ $# != 4 ]] ; then
echo 'usage: log_slice.sh <start time> <end time> <log file> <output file>'
echo
exit;
fi
if [ ! -f $3 ] ; then
echo "'$3' doesn't seem to exit."
echo 'exiting.'
exit;
fi
sline=$(grep -n " ${1}" $3|head -1|cut -d: -f1) #what line number is first occurrance of start time
eline=$(grep -n " ${2}" $3|head -1|cut -d: -f1) #what line number is first occurrance of end time
linediff="$((eline-sline))"
tail -n+${sline} $3|head -n$linediff > $4
Related videos on Youtube
![Amelio Vazquez-Reina](https://i.stack.imgur.com/ilsZ4.jpg?s=256&g=1)
Amelio Vazquez-Reina
I'm passionate about people, technology and research. Some of my favorite quotes: "Far better an approximate answer to the right question than an exact answer to the wrong question" -- J. Tukey, 1962. "Your title makes you a manager, your people make you a leader" -- Donna Dubinsky, quoted in "Trillion Dollar Coach", 2019.
Updated on September 18, 2022Comments
-
Amelio Vazquez-Reina almost 2 years
Say I have a huge text file (>2GB) and I just want to
cat
the linesX
toY
(e.g. 57890000 to 57890010).From what I understand I can do this by piping
head
intotail
or viceversa, i.e.head -A /path/to/file | tail -B
or alternatively
tail -C /path/to/file | head -D
where
A
,B
,C
andD
can be computed from the number of lines in the file,X
andY
.But there are two problems with this approach:
- You have to compute
A
,B
,C
andD
. - The commands could
pipe
to each other many more lines than I am interested in reading (e.g. if I am reading just a few lines in the middle of a huge file)
Is there a way to have the shell just work with and output the lines I want? (while providing only
X
andY
)?-
Admin almost 12 yearsFYI, actual speed test comparison of 6 methods added to my answer.
-
Admin almost 9 years
-
Admin over 4 yearsYou can consider using the split command too !
- You have to compute
-
Paweł Rumian almost 12 yearsOut of curiosity: how have you flushed the disk cache between tests?
-
Gilles 'SO- stop being evil' almost 12 yearsWhat about
tail -n +50000000 test.in | head -n10
, which unliketail -n-50000000 test.in | head -n10
would give the correct result? -
Gilles 'SO- stop being evil' almost 12 yearsOk, I went and did some benchmarks. tail|head is way faster than sed, the difference is a lot more than I expected.
-
Gilles 'SO- stop being evil' almost 12 yearsI expected
sed
andtail | head
to be about on par, but it turns out thattail | head
is significantly faster (see my answer). -
Paweł Rumian almost 12 yearsAre you sure that there is no caching inbetween? The differences between tail|head and sed seem too big to me.
-
Gilles 'SO- stop being evil' almost 12 years@gorkypl I did several measures and the times were comparable. As I wrote, this is all happening in RAM (everything is in the cache).
-
Kevin almost 12 years@Gilles you're right, my bad.
tail+|head
is faster by 10-15% than sed, I've added that benchmark. -
Kevin almost 12 years@gorkypl I didn't, I figure it should be cached for all of them, certainly by the third iteration.
-
Gilles 'SO- stop being evil' almost 12 yearsI wonder why I'm seeing
sed
being so much slower thantail|head
and you're seeing only a small difference (which is what I'd expected before doing the benchmark). I did my tests on Debian stable. -
Kevin almost 12 years@Gilles I'm on a mac, so BSD utils, but I'd be surprised if the difference between BSD and GNU
head
andtail
are really so big. -
vonbrand over 11 yearsHumm,
tail -N
has to stashN
lines somewhere to be able to get the last ones out at the end. Not nice on your memory. I'd vote forsed
(orawk
in a pinch, but the whole "split into fields" stuff adds overhead). Just guessing. -
erik over 10 yearsEdited the test: Ran on
/dev/shm
to not use the disk cache but run the whole stuff in memory. -
Danny Kirchmeier about 10 yearsI realize that the question asks for lines, but if you use the
-c
to skip characters,tail+|head
is instantaneous. Of course, you can't say "50000000" and may have to manually search out the start of the section you're looking for. -
G-Man Says 'Reinstate Monica' over 9 yearsYou're answering a question that wasn't asked. Your answer is 10%
tail|head
, which has been discussed extensively in the question and the other answers, and 90% determining the line numbers where specified strings/patterns appear, which wasn't part of the question. P.S. you should always quote your shell parameters and variables; e.g., "$3" and "$4". -
Soheil over 9 yearsDon't forget to add
{print; exit}
if you want awk to return, ie:awk "NR==10{print; exit}" file
-
mitchus almost 9 yearsHey Guys!!. There are 11 elements from
5...00
to5...10
. So, allhead -n10
should be corrected tohead -n11
. Am I missing something here? -
Gilles 'SO- stop being evil' almost 9 years@BinaryZebra Yes, if the input is a regular file, some implementations of
tail
(including GNU tail) have heuristics to read from the end. That improves thetail | head
solution compared to other methods. -
Admin almost 9 yearsComments are not for extended discussion; this conversation has been moved to chat.
-
Admin almost 9 years@BinaryZebra - way better.
-
clacke about 8 yearsNever seen the redirection go first on the line before. Cool, it makes the pipe flow clearer.
-
underscore_d over 7 yearsYou ran some of your commands relative to line
50000000
but others to line57890000
. Unfortunately, this inconsistency isn't confined to independent commands: your 2nd command mixes the 2 unrelated bases:sed -n '50000000,50000010p;57890010q' test.in
If that's what you actually ran, it would produce an artificially crippled benchmark - by wasting time reading and discarding 7890000 lines that it doesn't need. Or did you run the right command but just transcribe it wrongly here? -
underscore_d over 7 yearsI dunno, from what I've read,
tail
/head
are considered more "orthodox", since trimming either end of a file is precisely what they're made for. In those materials,sed
only seems to enter the picture when substitutions are required - and to quickly be pushed out of the picture when anything much more complex starts to happen, since its syntax for complex tasks is so much worse than AWK, which then takes over. -
Kevin over 7 years@underscore_d it was 4 years ago, who knows what I ran. If you want to check, you can run the benchmarks yourself.
-
Rodrigo over 7 yearsYou haven't specified how to pass the filename as parameter
-
Nalous Nalous about 4 yearstail -n+49000000 test.in | head -n10 takes 2 seconds! Beats anything else on dual core + SSD (Linux blade 4.10.0-42-generic #46~16.04.1-Ubuntu) head -n500200100 zzz | tail -n10 (7 seconds); head -n999200100 zzz | tail -n10 (12 seconds)
-
Scott - Слава Україні almost 4 yearsPretty much every other answer has at least mentioned, if not recommended,
tail
piped intohead
. Adding acat
to the pipeline just adds noise and overhead, but doesn’t add value. You might as well say “If you’re wearing gloves when you type the command, you will want to use tail first and then head.” — it doesn’t matter. -
Scott - Слава Україні almost 4 yearsYou say “This is the solution that worked for me.” [emphasis added] I say it is a solution that works. Did you try
tail -n +3 file.name | head -n -1
? What happened?