cat line X to line Y on a huge file

337,523

Solution 1

I suggest the sed solution, but for the sake of completeness,

awk 'NR >= 57890000 && NR <= 57890010' /path/to/file

To cut out after the last line:

awk 'NR < 57890000 { next } { print } NR == 57890010 { exit }' /path/to/file

Speed test (here on macOS, YMMV on other systems):

  • 100,000,000-line file generated by seq 100000000 > test.in
  • Reading lines 50,000,000-50,000,010
  • Tests in no particular order
  • real time as reported by bash's builtin time
 4.373  4.418  4.395    tail -n+50000000 test.in | head -n10
 5.210  5.179  6.181    sed -n '50000000,50000010p;57890010q' test.in
 5.525  5.475  5.488    head -n50000010 test.in | tail -n10
 8.497  8.352  8.438    sed -n '50000000,50000010p' test.in
22.826 23.154 23.195    tail -n50000001 test.in | head -n10
25.694 25.908 27.638    ed -s test.in <<<"50000000,50000010p"
31.348 28.140 30.574    awk 'NR<57890000{next}1;NR==57890010{exit}' test.in
51.359 50.919 51.127    awk 'NR >= 57890000 && NR <= 57890010' test.in

These are by no means precise benchmarks, but the difference is clear and repeatable enough* to give a good sense of the relative speed of each of these commands.

*: Except between the first two, sed -n p;q and head|tail, which seem to be essentially the same.

Solution 2

If you want lines X to Y inclusive (starting the numbering at 1), use

tail -n "+$X" /path/to/file | head -n "$((Y-X+1))"

tail will read and discard the first X-1 lines (there's no way around that), then read and print the following lines. head will read and print the requested number of lines, then exit. When head exits, tail receives a SIGPIPE signal and dies, so it won't have read more than a buffer size's worth (typically a few kilobytes) of lines from the input file.

Alternatively, as gorkypl suggested, use sed:

sed -n -e "$X,$Y p" -e "$Y q" /path/to/file

The sed solution is significantly slower though (at least for GNU utilities and Busybox utilities; sed might be more competitive if you extract a large part of the file on an OS where piping is slow and sed is fast). Here are quick benchmarks under Linux; the data was generated by seq 100000000 >/tmp/a, the environment is Linux/amd64, /tmp is tmpfs and the machine is otherwise idle and not swapping.

real  user  sys    command
 0.47  0.32  0.12  </tmp/a tail -n +50000001 | head -n 10 #GNU
 0.86  0.64  0.21  </tmp/a tail -n +50000001 | head -n 10 #BusyBox
 3.57  3.41  0.14  sed -n -e '50000000,50000010 p' -e '50000010q' /tmp/a #GNU
11.91 11.68  0.14  sed -n -e '50000000,50000010 p' -e '50000010q' /tmp/a #BusyBox
 1.04  0.60  0.46  </tmp/a tail -n +50000001 | head -n 40000001 >/dev/null #GNU
 7.12  6.58  0.55  </tmp/a tail -n +50000001 | head -n 40000001 >/dev/null #BusyBox
 9.95  9.54  0.28  sed -n -e '50000000,90000000 p' -e '90000000q' /tmp/a >/dev/null #GNU
23.76 23.13  0.31  sed -n -e '50000000,90000000 p' -e '90000000q' /tmp/a >/dev/null #BusyBox

If you know the byte range you want to work with, you can extract it faster by skipping directly to the start position. But for lines, you have to read from the beginning and count newlines. To extract blocks from x inclusive to y exclusive starting at 0, with a block size of b:

dd bs="$b" seek="$x" count="$((y-x))" </path/to/file

Solution 3

The head | tail approach is one of the best and most "idiomatic" ways to do this:

X=57890000
Y=57890010
< infile.txt head -n "$Y" | tail -n +"$X"

As pointed out by Gilles in the comments, a faster way is

< infile.txt tail -n +"$X" | head -n "$((Y - X))"

The reason this is faster is the first X - 1 lines don't need to go through the pipe compared to the head | tail approach.

Your question as phrased is a bit misleading and probably explains some of your unfounded misgivings towards this approach.

  • You say you have to calculate A, B, C, D but as you can see, the line count of the file is not needed and at most 1 calculation is necessary, which the shell can do for you anyways.

  • You worry that piping will read more lines than necessary. In fact this is not true: tail | head is about as efficient as you can get in terms of file I/O. First, consider the minimum amount of work necessary: to find the X'th line in a file, the only general way to do it is to read every byte and stop when you count X newline symbols as there is no way to divine the file offset of the X'th line. Once you reach the *X*th line, you have to read all the lines in order to print them, stopping at the Y'th line. Thus no approach can get away with reading less than Y lines. Now, head -n $Y reads no more than Y lines (rounded to the nearest buffer unit, but buffers if used correctly improve performance, so no need to worry about that overhead). In addition, tail will not read any more than head, so thus we have shown that head | tail reads the fewest number of lines possible (again, plus some negligible buffering that we are ignoring). The only efficiency advantage of a single tool approach that does not use pipes is fewer processes (and thus less overhead).

Solution 4

The most orthodox way (but not the fastest, as noted by Gilles above) would be to use sed.

In your case:

X=57890000
Y=57890010
sed -n -e "$X,$Y p" -e "$Y q" filename

The -n option implies that only the relevant lines are printed to stdout.

The p at the end of finishing line number means to print lines in given range. The q in second part of the script saves some time by skipping the remainder of the file.

Solution 5

I do this often enough and so wrote this script. I don't need to find the line numbers, the script does it all.

#!/bin/bash

# $1: start time
# $2: end time
# $3: log file to read
# $4: output file

# i.e. log_slice.sh 18:33 19:40 /var/log/my.log /var/log/myslice.log

if [[ $# != 4 ]] ; then 
echo 'usage: log_slice.sh <start time> <end time> <log file> <output file>'
echo
exit;
fi

if [ ! -f $3 ] ; then
echo "'$3' doesn't seem to exit."
echo 'exiting.'
exit;
fi

sline=$(grep -n " ${1}" $3|head -1|cut -d: -f1)  #what line number is first occurrance of start time
eline=$(grep -n " ${2}" $3|head -1|cut -d: -f1)  #what line number is first occurrance of end time

linediff="$((eline-sline))"

tail -n+${sline} $3|head -n$linediff > $4
Share:
337,523

Related videos on Youtube

Amelio Vazquez-Reina
Author by

Amelio Vazquez-Reina

I'm passionate about people, technology and research. Some of my favorite quotes: "Far better an approximate answer to the right question than an exact answer to the wrong question" -- J. Tukey, 1962. "Your title makes you a manager, your people make you a leader" -- Donna Dubinsky, quoted in "Trillion Dollar Coach", 2019.

Updated on September 18, 2022

Comments

  • Amelio Vazquez-Reina
    Amelio Vazquez-Reina almost 2 years

    Say I have a huge text file (>2GB) and I just want to cat the lines X to Y (e.g. 57890000 to 57890010).

    From what I understand I can do this by piping head into tail or viceversa, i.e.

    head -A /path/to/file | tail -B
    

    or alternatively

    tail -C /path/to/file | head -D
    

    where A,B,C and D can be computed from the number of lines in the file, X and Y.

    But there are two problems with this approach:

    1. You have to compute A,B,C and D.
    2. The commands could pipe to each other many more lines than I am interested in reading (e.g. if I am reading just a few lines in the middle of a huge file)

    Is there a way to have the shell just work with and output the lines I want? (while providing only X and Y)?

  • Paweł Rumian
    Paweł Rumian almost 12 years
    Out of curiosity: how have you flushed the disk cache between tests?
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' almost 12 years
    What about tail -n +50000000 test.in | head -n10, which unlike tail -n-50000000 test.in | head -n10 would give the correct result?
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' almost 12 years
    Ok, I went and did some benchmarks. tail|head is way faster than sed, the difference is a lot more than I expected.
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' almost 12 years
    I expected sed and tail | head to be about on par, but it turns out that tail | head is significantly faster (see my answer).
  • Paweł Rumian
    Paweł Rumian almost 12 years
    Are you sure that there is no caching inbetween? The differences between tail|head and sed seem too big to me.
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' almost 12 years
    @gorkypl I did several measures and the times were comparable. As I wrote, this is all happening in RAM (everything is in the cache).
  • Kevin
    Kevin almost 12 years
    @Gilles you're right, my bad. tail+|head is faster by 10-15% than sed, I've added that benchmark.
  • Kevin
    Kevin almost 12 years
    @gorkypl I didn't, I figure it should be cached for all of them, certainly by the third iteration.
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' almost 12 years
    I wonder why I'm seeing sed being so much slower than tail|head and you're seeing only a small difference (which is what I'd expected before doing the benchmark). I did my tests on Debian stable.
  • Kevin
    Kevin almost 12 years
    @Gilles I'm on a mac, so BSD utils, but I'd be surprised if the difference between BSD and GNU head and tail are really so big.
  • vonbrand
    vonbrand over 11 years
    Humm, tail -N has to stash N lines somewhere to be able to get the last ones out at the end. Not nice on your memory. I'd vote for sed (or awk in a pinch, but the whole "split into fields" stuff adds overhead). Just guessing.
  • erik
    erik over 10 years
    Edited the test: Ran on /dev/shm to not use the disk cache but run the whole stuff in memory.
  • Danny Kirchmeier
    Danny Kirchmeier about 10 years
    I realize that the question asks for lines, but if you use the -c to skip characters, tail+|head is instantaneous. Of course, you can't say "50000000" and may have to manually search out the start of the section you're looking for.
  • G-Man Says 'Reinstate Monica'
    G-Man Says 'Reinstate Monica' over 9 years
    You're answering a question that wasn't asked. Your answer is 10% tail|head, which has been discussed extensively in the question and the other answers, and 90% determining the line numbers where specified strings/patterns appear, which wasn't part of the question. P.S. you should always quote your shell parameters and variables; e.g., "$3" and "$4".
  • Soheil
    Soheil over 9 years
    Don't forget to add {print; exit} if you want awk to return, ie: awk "NR==10{print; exit}" file
  • mitchus
    mitchus almost 9 years
    Hey Guys!!. There are 11 elements from 5...00 to 5...10. So, all head -n10 should be corrected to head -n11. Am I missing something here?
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' almost 9 years
    @BinaryZebra Yes, if the input is a regular file, some implementations of tail (including GNU tail) have heuristics to read from the end. That improves the tail | head solution compared to other methods.
  • Admin
    Admin almost 9 years
    Comments are not for extended discussion; this conversation has been moved to chat.
  • Admin
    Admin almost 9 years
    @BinaryZebra - way better.
  • clacke
    clacke about 8 years
    Never seen the redirection go first on the line before. Cool, it makes the pipe flow clearer.
  • underscore_d
    underscore_d over 7 years
    You ran some of your commands relative to line 50000000 but others to line 57890000. Unfortunately, this inconsistency isn't confined to independent commands: your 2nd command mixes the 2 unrelated bases: sed -n '50000000,50000010p;57890010q' test.in If that's what you actually ran, it would produce an artificially crippled benchmark - by wasting time reading and discarding 7890000 lines that it doesn't need. Or did you run the right command but just transcribe it wrongly here?
  • underscore_d
    underscore_d over 7 years
    I dunno, from what I've read, tail/head are considered more "orthodox", since trimming either end of a file is precisely what they're made for. In those materials, sed only seems to enter the picture when substitutions are required - and to quickly be pushed out of the picture when anything much more complex starts to happen, since its syntax for complex tasks is so much worse than AWK, which then takes over.
  • Kevin
    Kevin over 7 years
    @underscore_d it was 4 years ago, who knows what I ran. If you want to check, you can run the benchmarks yourself.
  • Rodrigo
    Rodrigo over 7 years
    You haven't specified how to pass the filename as parameter
  • Nalous Nalous
    Nalous Nalous about 4 years
    tail -n+49000000 test.in | head -n10 takes 2 seconds! Beats anything else on dual core + SSD (Linux blade 4.10.0-42-generic #46~16.04.1-Ubuntu) head -n500200100 zzz | tail -n10 (7 seconds); head -n999200100 zzz | tail -n10 (12 seconds)
  • Scott - Слава Україні
    Scott - Слава Україні almost 4 years
    Pretty much every other answer has at least mentioned, if not recommended, tail piped into head.  Adding a cat to the pipeline just adds noise and overhead, but doesn’t add value.  You might as well say “If you’re wearing gloves when you type the command, you will want to use tail first and then head.” — it doesn’t matter.
  • Scott - Слава Україні
    Scott - Слава Україні almost 4 years
    You say “This is the solution that worked for me.” [emphasis added] I say it is a solution that works.  Did you try tail -n +3 file.name | head -n -1?  What happened?