Bash tool to get nth line from a file
Solution 1
head
and pipe with tail
will be slow for a huge file. I would suggest sed
like this:
sed 'NUMq;d' file
Where NUM
is the number of the line you want to print; so, for example, sed '10q;d' file
to print the 10th line of file
.
Explanation:
NUMq
will quit immediately when the line number is NUM
.
d
will delete the line instead of printing it; this is inhibited on the last line because the q
causes the rest of the script to be skipped when quitting.
If you have NUM
in a variable, you will want to use double quotes instead of single:
sed "${NUM}q;d" file
Solution 2
sed -n '2p' < file.txt
will print 2nd line
sed -n '2011p' < file.txt
2011th line
sed -n '10,33p' < file.txt
line 10 up to line 33
sed -n '1p;3p' < file.txt
1st and 3th line
and so on...
For adding lines with sed, you can check this:
sed: insert a line in a certain position
Solution 3
I have a unique situation where I can benchmark the solutions proposed on this page, and so I'm writing this answer as a consolidation of the proposed solutions with included run times for each.
Set Up
I have a 3.261 gigabyte ASCII text data file with one key-value pair per row. The file contains 3,339,550,320 rows in total and defies opening in any editor I have tried, including my go-to Vim. I need to subset this file in order to investigate some of the values that I've discovered only start around row ~500,000,000.
Because the file has so many rows:
- I need to extract only a subset of the rows to do anything useful with the data.
- Reading through every row leading up to the values I care about is going to take a long time.
- If the solution reads past the rows I care about and continues reading the rest of the file it will waste time reading almost 3 billion irrelevant rows and take 6x longer than necessary.
My best-case-scenario is a solution that extracts only a single line from the file without reading any of the other rows in the file, but I can't think of how I would accomplish this in Bash.
For the purposes of my sanity I'm not going to be trying to read the full 500,000,000 lines I'd need for my own problem. Instead I'll be trying to extract row 50,000,000 out of 3,339,550,320 (which means reading the full file will take 60x longer than necessary).
I will be using the time
built-in to benchmark each command.
Baseline
First let's see how the head
tail
solution:
$ time head -50000000 myfile.ascii | tail -1
pgm_icnt = 0
real 1m15.321s
The baseline for row 50 million is 00:01:15.321, if I'd gone straight for row 500 million it'd probably be ~12.5 minutes.
cut
I'm dubious of this one, but it's worth a shot:
$ time cut -f50000000 -d$'\n' myfile.ascii
pgm_icnt = 0
real 5m12.156s
This one took 00:05:12.156 to run, which is much slower than the baseline! I'm not sure whether it read through the entire file or just up to line 50 million before stopping, but regardless this doesn't seem like a viable solution to the problem.
AWK
I only ran the solution with the exit
because I wasn't going to wait for the full file to run:
$ time awk 'NR == 50000000 {print; exit}' myfile.ascii
pgm_icnt = 0
real 1m16.583s
This code ran in 00:01:16.583, which is only ~1 second slower, but still not an improvement on the baseline. At this rate if the exit command had been excluded it would have probably taken around ~76 minutes to read the entire file!
Perl
I ran the existing Perl solution as well:
$ time perl -wnl -e '$.== 50000000 && print && exit;' myfile.ascii
pgm_icnt = 0
real 1m13.146s
This code ran in 00:01:13.146, which is ~2 seconds faster than the baseline. If I'd run it on the full 500,000,000 it would probably take ~12 minutes.
sed
The top answer on the board, here's my result:
$ time sed "50000000q;d" myfile.ascii
pgm_icnt = 0
real 1m12.705s
This code ran in 00:01:12.705, which is 3 seconds faster than the baseline, and ~0.4 seconds faster than Perl. If I'd run it on the full 500,000,000 rows it would have probably taken ~12 minutes.
mapfile
I have bash 3.1 and therefore cannot test the mapfile solution.
Conclusion
It looks like, for the most part, it's difficult to improve upon the head
tail
solution. At best the sed
solution provides a ~3% increase in efficiency.
(percentages calculated with the formula % = (runtime/baseline - 1) * 100
)
Row 50,000,000
- 00:01:12.705 (-00:00:02.616 = -3.47%)
sed
- 00:01:13.146 (-00:00:02.175 = -2.89%)
perl
- 00:01:15.321 (+00:00:00.000 = +0.00%)
head|tail
- 00:01:16.583 (+00:00:01.262 = +1.68%)
awk
- 00:05:12.156 (+00:03:56.835 = +314.43%)
cut
Row 500,000,000
- 00:12:07.050 (-00:00:26.160)
sed
- 00:12:11.460 (-00:00:21.750)
perl
- 00:12:33.210 (+00:00:00.000)
head|tail
- 00:12:45.830 (+00:00:12.620)
awk
- 00:52:01.560 (+00:40:31.650)
cut
Row 3,338,559,320
- 01:20:54.599 (-00:03:05.327)
sed
- 01:21:24.045 (-00:02:25.227)
perl
- 01:23:49.273 (+00:00:00.000)
head|tail
- 01:25:13.548 (+00:02:35.735)
awk
- 05:47:23.026 (+04:24:26.246)
cut
Solution 4
With awk
it is pretty fast:
awk 'NR == num_line' file
When this is true, the default behaviour of awk
is performed: {print $0}
.
Alternative versions
If your file happens to be huge, you'd better exit
after reading the required line. This way you save CPU time See time comparison at the end of the answer.
awk 'NR == num_line {print; exit}' file
If you want to give the line number from a bash variable you can use:
awk 'NR == n' n=$num file
awk -v n=$num 'NR == n' file # equivalent
See how much time is saved by using exit
, specially if the line happens to be in the first part of the file:
# Let's create a 10M lines file
for ((i=0; i<100000; i++)); do echo "bla bla"; done > 100Klines
for ((i=0; i<100; i++)); do cat 100Klines; done > 10Mlines
$ time awk 'NR == 1234567 {print}' 10Mlines
bla bla
real 0m1.303s
user 0m1.246s
sys 0m0.042s
$ time awk 'NR == 1234567 {print; exit}' 10Mlines
bla bla
real 0m0.198s
user 0m0.178s
sys 0m0.013s
So the difference is 0.198s vs 1.303s, around 6x times faster.
Solution 5
According to my tests, in terms of performance and readability my recommendation is:
tail -n+N | head -1
N
is the line number that you want. For example, tail -n+7 input.txt | head -1
will print the 7th line of the file.
tail -n+N
will print everything starting from line N
, and head -1
will make it stop after one line.
The alternative head -N | tail -1
is perhaps slightly more readable. For example, this will print the 7th line:
head -7 input.txt | tail -1
When it comes to performance, there is not much difference for smaller sizes, but it will be outperformed by the tail | head
(from above) when the files become huge.
The top-voted sed 'NUMq;d'
is interesting to know, but I would argue that it will be understood by fewer people out of the box than the head/tail solution and it is also slower than tail/head.
In my tests, both tails/heads versions outperformed sed 'NUMq;d'
consistently. That is in line with the other benchmarks that were posted. It is hard to find a case where tails/heads was really bad. It is also not surprising, as these are operations that you would expect to be heavily optimized in a modern Unix system.
To get an idea about the performance differences, these are the number that I get for a huge file (9.3G):
tail -n+N | head -1
: 3.7 sechead -N | tail -1
: 4.6 secsed Nq;d
: 18.8 sec
Results may differ, but the performance head | tail
and tail | head
is, in general, comparable for smaller inputs, and sed
is always slower by a significant factor (around 5x or so).
To reproduce my benchmark, you can try the following, but be warned that it will create a 9.3G file in the current working directory:
#!/bin/bash
readonly file=tmp-input.txt
readonly size=1000000000
readonly pos=500000000
readonly retries=3
seq 1 $size > $file
echo "*** head -N | tail -1 ***"
for i in $(seq 1 $retries) ; do
time head "-$pos" $file | tail -1
done
echo "-------------------------"
echo
echo "*** tail -n+N | head -1 ***"
echo
seq 1 $size > $file
ls -alhg $file
for i in $(seq 1 $retries) ; do
time tail -n+$pos $file | head -1
done
echo "-------------------------"
echo
echo "*** sed Nq;d ***"
echo
seq 1 $size > $file
ls -alhg $file
for i in $(seq 1 $retries) ; do
time sed $pos'q;d' $file
done
/bin/rm $file
Here is the output of a run on my machine (ThinkPad X1 Carbon with an SSD and 16G of memory). I assume in the final run everything will come from the cache, not from disk:
*** head -N | tail -1 ***
500000000
real 0m9,800s
user 0m7,328s
sys 0m4,081s
500000000
real 0m4,231s
user 0m5,415s
sys 0m2,789s
500000000
real 0m4,636s
user 0m5,935s
sys 0m2,684s
-------------------------
*** tail -n+N | head -1 ***
-rw-r--r-- 1 phil 9,3G Jan 19 19:49 tmp-input.txt
500000000
real 0m6,452s
user 0m3,367s
sys 0m1,498s
500000000
real 0m3,890s
user 0m2,921s
sys 0m0,952s
500000000
real 0m3,763s
user 0m3,004s
sys 0m0,760s
-------------------------
*** sed Nq;d ***
-rw-r--r-- 1 phil 9,3G Jan 19 19:50 tmp-input.txt
500000000
real 0m23,675s
user 0m21,557s
sys 0m1,523s
500000000
real 0m20,328s
user 0m18,971s
sys 0m1,308s
500000000
real 0m19,835s
user 0m18,830s
sys 0m1,004s
Vlad Vivdovitch
Updated on May 20, 2021Comments
-
Vlad Vivdovitch almost 3 years
Is there a "canonical" way of doing that? I've been using
head -n | tail -1
which does the trick, but I've been wondering if there's a Bash tool that specifically extracts a line (or a range of lines) from a file.By "canonical" I mean a program whose main function is doing that.
-
Rafael Barbosa almost 11 yearsWhy is the '<' necessary in this case? Wouldn't I achieve the same output without it?
-
clt60 almost 11 years@RafaelBarbosa the
<
in this case is not necessary. Simply, it is my preference using redirects, because me often used redirects likesed -n '100p' < <(some_command)
- so, universal syntax :). It is NOT less effective, because redirection are done with shell when forking itself, so... it is only a preference... (and yes, it is one character longer) :) -
tripleee over 10 yearsThe
-n
option disables the default action to print every line, as surely you would have found out by a quick glance at the man page. -
Skippy le Grand Gourou about 10 yearsFor those wondering, this solution seems about 6 to 9 times faster than the
sed -n 'NUMp'
andsed 'NUM!d'
solutions proposed below. -
rici about 10 yearsI think
tail -n+NUM file | head -n1
is likely to be just as fast or faster. At least, it was (significantly) faster on my system when I tried it with NUM being 250000 on a file with half a million lines. YMMV, but I don't really see why it would. -
tripleee about 8 yearsThe colon is a syntax error, and should be a semicolon.
-
rasen58 over 7 years@jm666 Actually it's 2 characters longer since you would normally put the '<' as well as an extra space ' ' after < as oppposed to just one space if you hadn't used the < :)
-
clt60 over 7 years@rasen58 the space is an character too? :) /okay, just kidding - youre right/ :)
-
agc about 7 yearsIn GNU
sed
all thesed
answers are about the same speed. Therefore (for GNUsed
) this is the bestsed
answer, since it would save time for large files and small nth line values. -
wisbucky over 6 yearsIs performance different between
head | tail
vstail | head
? Or does it depend on which line is being printed (beginning of file vs end of file)? -
Philipp Claßen over 6 years@wisbucky I have no hard figures, but one disadvantage of first using tail followed by a "head -1" is that you need to know the total length in advance. If you do not know it, you would have to count it first, which will be a loss performance-wise. Another disadvantage is that it is less intuitive to use. For instance, if you have the number 1 to 10 and you want to get the 3rd line, you would have to use "tail -8 | head -1". That is more error prone than "head -3 | tail -1".
-
wisbucky over 6 yearssorry, I should have included an example to be clear.
head -5 | tail -1
vstail -n+5 | head -1
. Actually, I found another answer that did a test comparison and foundtail | head
to be faster. stackoverflow.com/a/48189289 -
Philipp Claßen over 6 years@wisbucky Thank you for mentioning it! I did some tests and have to agree that it was always slightly faster, independent of the position of the line from what I saw. Given that, I changed my answer and also included the benchmark in case someone wants to reproduce it.
-
Andriy Makukha about 6 yearsCan be also used to display multiple lines:
cat FILE | cut -f2,5 -d$'\n'
will display lines 2 and 5 of the FILE. (But it will not preserve the order.) -
duhaime almost 6 yearsThis is about 5 times slower than the tail / head combination when reading a file with 50M rows
-
clt60 almost 6 years@duhaime of course, if someone needs to do optimizations. But IMHO for the "common" problems it is ok and the difference is unnoticeable. Also, the
head
/tail
doesn't solves thesed -n '1p;3p'
scenario - aka print more non-adjacent rows... -
duhaime almost 6 yearsAmen! Just wanted to create a note for fools like me who have to do line lookups billions of times for some task...
-
clt60 almost 6 years@duhaime of course - the note is correct and needed. :)
-
sanmai about 5 yearsI wonder how long just cat'ting the entire file into /dev/null would take. (What if this was only a hard disk benchmark?)
-
Stabledog about 4 yearsI feel a perverse urge to bow at your ownership of a 3+ gig text file dictionary. Whatever the rationale, this so embraces textuality :)
-
kvantour almost 4 yearsThis method is always going to be slower because awk attempts to do field splitting. The overhead of field splitting can be reduced by
awk 'BEGIN{FS=RS}(NR == num_line) {print; exit}' file
-
kvantour almost 4 yearsThe real power of awk in this method comes forth when you want to concatenate line n1 of file1, n2 of file2, n3 or file3 ...
awk 'FNR==n' n=10 file1 n=30 file2 n=60 file3
. With GNU awk this can be sped up usingawk 'FNR==n{print;nextfile}' n=10 file1 n=30 file2 n=60 file3
. -
fedorqui almost 4 years@kvantour indeed, GNU awk's nextfile is great for such things. How come
FS=RS
avoids field splitting? -
kvantour almost 4 years
FS=RS
does not avoid field splitting, but it only parses the $0 ones and only assigns one field because there is noRS
in$0
-
fedorqui almost 4 years@kvantour I've been doing some tests with
FS=RS
and did not see difference on the timings. What about me asking a question about it so you can expand? Thanks! -
Hashbrown over 3 years@rici and you can easily choose how many lines past that point by changing
head -n1
tohead -nNUM2
, you should make this it's own answer -
NoCake over 3 yearsWhile testing on a file with 6,000,000 lines, and retrieving arbitrary line #2,000,000, this command was almost instantaneous and much faster than the sed answers.
-
Herman Toothrot over 3 yearsHow do you use this with a range of lines from line n to m?
-
anubhava over 3 yearsFor range better use
sed -n '2,5{p;5q;}' file
-
tripleee over 3 yearsThe overhead of running two processes with
head
+tail
will be negligible for a single file, but starts to show when you do this on many files. -
tripleee over 3 yearsThere is no need to assign
$1
to another variable before using it, and you are clobbering any other globalline
. In Bash, uselocal
for function variables; but here, as stated already, probably just dosed "$1d;q" "$2"
. (Notice also the quoting of"$2"
.) -
Mark Shust at M.academy over 3 yearsCorrect, but it could be helpful to have self-documented code.
-
ntj over 3 years@anubhava isn't
sed -n '2,5p' file
just the same? -
anubhava over 3 yearsno it is not. Without
q
it will process full file -
Fiddlestiques about 3 yearsThis is a good solution, but if you want to assign the output of the sed command to a variable using command substitution it will not work if the returned line contains a double asterisk. In the case where
sed "4q;d" file4
returns** banana
,foo=$(sed "4q;d" file4)
will assign the valuefile1 file2 file3 file4 banana
to the variablefoo
(file1 - file4 being the directory contents). -
anubhava about 3 years@Fiddlestiques: Don't forget quoting to make it
foo="$(sed "4q;d" file4)"
-
Fiddlestiques about 3 years@anubhava - thanks - got it now - echo "$foo" rather than echo $foo
-
Ulysse BN about 2 yearsMade a script from this solution
usage: nth line [file]
, if file is omitted looks at stdin : github.com/BuonOmo/dotfiles/blob/main/.zsh/custom/functions/nth -
algae almost 2 yearsIs there a simple way to extend this solution to multiple files at once? e.g.
head -7 -q input*.txt | tail -1
to get the 7th line from several filesinput*.txt
? Currently this will just obtain the 7th line from the first file listed ininput*.txt
. -
Yeti almost 2 yearsNote that the first line has N = 1 instead of zero.
-
jonathanking almost 2 yearsI'm sorry, but I don't understand why
d
is necessary? Why do you need to "delete" anything at all in this case? -
jonathanking almost 2 years@anubhava I read the answer, but I created a comment because I did not understand the explanation. Why even include the "delete" command if the end result is that the command is inhibited by the "quit" command? Why not just have "NUMq"? Why are we deleting instead of printing for the original poster's question?
-
anubhava almost 2 yearsTry running
sed "${NUM}q" file
and you will understand better why;d
is also needed