How to find line with least characters
Solution 1
A Perl way. Note that if there are many lines of the same, shortest length, this approach will only print one of them:
perl -lne '$m//=$_; $m=$_ if length()<length($m); END{print $m if $.}' file
Explanation
perl -lne
:-n
means "read the input file line by line",-l
causes trailing newlines to be removed from each input line and a newline to be added to eachprint
call; and-e
is the script that will be applied to each line.$m//=$_
: set$m
to the current line ($_
) unless$m
is defined. The//=
operator is available since Perl 5.10.0.$m=$_ if length()<length($m)
: if the length of the current value of$m
is greater than the length of the current line, save the current line ($_
) as$m
.END{print $m if $.}
: once all lines have been processed, print the current value of$m
, the shortest line. Theif $.
ensures that this only happens when the line number ($.
) is defined, avoiding printing an empty line for blank input.
Alternatively, since your file is small enough to fit in memory, you can do:
perl -e '@K=sort{length($a) <=> length($b)}<>; print "$K[0]"' file
Explanation
@K=sort{length($a) <=> length($b)}<>
:<>
here is an array whose elements are the lines of the file. Thesort
will sort them according to their length and the sorted lines are saved as array@K
.print "$K[0]"
: print the first element of array@K
: the shortest line.
If you want to print all shortest lines, you can use
perl -e '@K=sort{length($a) <=> length($b)}<>;
print grep {length($_)==length($K[0])}@K; ' file
Solution 2
Here's a variant of an awk
solution for printing the first found minimum line:
awk '
NR==1 || length<len {len=length; line=$0}
END {print line}
'
which can simply be extended by one condition to print all minimum lines:
awk '
length==len {line=line ORS $0}
NR==1 || length<len {len=length; line=$0}
END {print line}'
'
Solution 3
With sqlite3
:
sqlite3 <<EOT
CREATE TABLE file(line);
.import "data.txt" file
SELECT line FROM file ORDER BY length(line) LIMIT 1;
EOT
Solution 4
Python comes out fairly concise, and the code Does What It Says On The Tin:
python -c "import sys; print min(sys.stdin, key=len),"
The final comma is obscure, I admit. It prevents the print statement adding an additional linebreak. Additionally, you can write this in Python 3 supporting 0 lines like:
python3 -c "import sys; print(min(sys.stdin, key=len, default='').strip('\n'))"
Solution 5
I always love solutions with pure shell scripting (no exec!).
#!/bin/bash
min=
is_empty_input="yes"
while IFS= read -r a; do
if [ -z "$min" -a "$is_empty_input" = "yes" ] || [ "${#a}" -lt "${#min}" ]; then
min="$a"
fi
is_empty_input="no"
done
if [ -n "$a" ]; then
if [ "$is_empty_input" = "yes" ]; then
min="$a"
is_empty_input="no"
else
[ "${#a}" -lt "${#min}" ] && min="$a"
fi
fi
[ "$is_empty_input" = "no" ] && printf '%s\n' "$min"
Note:
There is a problem with NUL bytes in the input. So, printf "ab\0\0\ncd\n" | bash this_script
prints ab
instead of cd
.
Related videos on Youtube
Matthew D. Scholefield
I'm a passionate open source developer. Whenever I find something I don't understand, I rebuild it. Aside from programming, I occasionally do 3D graphics (With Blender), and a little music composition.
Updated on September 18, 2022Comments
-
Matthew D. Scholefield over 1 year
I am writing a shell script, using any general UNIX commands. I have to retrieve the line that has the least characters (whitespace included). There can be up to around 20 lines.
I know I can use
head -$L | tail -1 | wc -m
to find the character count of line L. The problem is, the only method I can think of, using that, would be to manually write a mess of if statements, comparing the values.Example data:
seven/7 4for 8 eight? five!
Would return
4for
since that line had the least characters.In my case, if multiple lines have the shortest length, a single one should be returned. It does not matter which one is selected, as long as it is of the minimum length. But I don't see the harm in showing both ways for other users with other situations.
-
chaos almost 9 yearsWhat if there are multiple line with length of 4? Should they be printed too?
-
Matthew D. Scholefield almost 9 yearsIn my case, if multiple lines have the shortest length, a single one should be returned. It does not matter which one is selected, as long as it is of the minimum length. But I don't see the harm in showing both ways for other users with other situations.
-
-
Thushi almost 9 years+1 for the logic but it won't work in all the cases. If the two lines are having the same number of characters and which is minimum. It will give you only the first line which is encountered because of
head -1
-
Thushi almost 9 yearsIt won't work if more than one line is having the same number of characters and which is also minimum.
-
cuonglm almost 9 years@Thushi: It will report the first minimum line.
-
Thushi almost 9 yearsYeah.But that's not correct output right? Even the other lines are having the minimum number of characters.
-
cuonglm almost 9 years@Thushi: That doesn't mention in OP requirement, waiting update from OP.
-
Thushi almost 9 yearsOk.No problem. It is just a general use case which I mentioned(Something like implicit use cases/requirements). Anyhow we will wait for him. :)
-
Toby Speight almost 9 yearsTo get the longest line, it's a bit more efficient to reverse the sort than to use
tail
(ashead
can exit as soon as its job is done, without reading the rest of its input). -
fedorqui almost 9 yearsI don't think
L
was the best letter to chose to name the variable :D Something likemin
would make things more clear -
chaos almost 9 yearsThat one is my favorite here, never thought of SQL...
-
Matthew D. Scholefield almost 9 years@Thushi Using a bit of regex, after printing line numbers, everything but the lines with the same number as line 1, could be removed, thus outputting all of the shortest lines.
-
mikeserv almost 9 yearswhat does the tin say?
-
Steve Jessop almost 9 years@mikeserve: it says, "prints the minimum of sys.stdin, using len as the key" ;-)
-
mikeserv almost 9 yearsahh. nothing about binary size, dependency creep or execution time, then?
-
Steve Jessop almost 9 years@mikeserv: no, the small print isn't on the tin. It's on an advisory leaflet in a locked filing cabinet, in a cellar, behind a door marked "beware of the leopard".
-
mikeserv almost 9 yearsGotcha - so on display.
-
cuonglm almost 9 yearsYou can use
push @{$lines{+length}};
andprint @{$lines{+min keys %lines}};
for less typing :) -
Angel Todorov almost 9 yearsIf I was golfing, I wouldn't have used the variable name "lines" either:
perl -MList::Util=min -nE'push @{$l{+length}},$_}END{say@{$l{min keys%l}}' sample
-
shadowtalker almost 9 yearsThis is code golf status clever
-
Peter.O almost 9 years+1 for a non-golfed version (which works!), though for only the print all variant. –
perl
gets a bit gnarly for those of us who aren't up to par.withperl
's cryptic nature. BTW. the golfedsay
prints a spurious blank line at the end.of the output. -
Digital Trauma almost 9 years
(( ${#a} < ${#min} ))
is possibly cleaner than[ "${#a}" -lt "${#min}" ]
. Its unusual, but in this case the double quotes around the string length expansions are not necessary - string length will always be a contiguous string of digits. -
Digital Trauma almost 9 years
-
mikeserv almost 9 yearsHave you tried benching your no exec! solution versus others which do? Here's a comparison of the performance differences between exec! and no exec! solutions for a similar problem. execing a separate process is very seldom advantageous when it spiders - in forms like
var=$(get data)
because it restricts the data flow to a single context - but when you move data through a pipeline - in a stream - each applied exec is generally helpful - because it enables specialized application of modular programs only where necessary. -
Digital Trauma almost 9 years@mikeserv Yes I hadn't considered possible effects of
$IFS
-
Digital Trauma almost 9 years@mikeserv Yes I think
expr
is nicer here. Yes,e
will spawn a shell for each line. I edited the sed expression so that it replaces each char in the string with a:
before the eval which I think should remove any possibility of code injection. -
Digital Trauma almost 9 yearsDo you even need to insert line numbers? My reading of the OP is that just the shortest line is required, and not necessarily the line number of that line. I guess no harm in showing it for completeness.
-
Stéphane Chazelas almost 9 years
-
mikeserv almost 9 yearsI would usually opt for
xargs expr
personally - but, other than avoiding an intermediate shell, that's probably more a stylistic thing. I like it, anyway. -
mikeserv almost 9 years@DigitalTrauma - nah, probably not. But it is hardly very useful without them - and they come so cheaply. When working a stream i always prefer to include a means of reproducing the original input identically in the output - the line-numbers make that possible here. For example, to turn the results of the first pipeline around:
REINPUT | sort -t: -nk1,1 | cut -d: -f3-
. And the second is a simple matter of including anothersed
--expression
script at the tail. -
mikeserv almost 9 years@DigitalTrauma - oh, and in the first example the line numbers do affect
sort
's behavior as a tie-breaker when same-length lines occur in input - so the earliest occurring line always floats to the top in that case. -
yaegashi almost 9 yearsThank you all for the comments and upvotes (some of the rep should go to @cuonglm for correcting my answer). Generally I don't recommend others to daily practice pure shell scripting but that skill can be found very useful in some extreme conditions where nothing other than static linked
/bin/sh
is available. It's happened to me several times with SunOS4 hosts with/usr
lost or some.so
damaged, and now in modern Linux age I still occasionally encounter similar situations with embedded systems or initrd of boot failing systems. BusyBox is one of the great things we recently acquired. -
John Kugelman almost 9 yearsWill this read the entire file into memory and/or create a second on-disk copy? If so, it's clever but inefficient.
-
FloHimself almost 9 years@JohnKugelman This will probably soak up the whole 4 lines into a temporary memory only database (that is what
strace
indicates). If you need to work with really large files (and your system isn't swapping), you can force it by just appending a filename likesqlite3 $(mktemp)
and all data will be written to disk. -
Digital Trauma almost 9 years@mikeserv From
man sed
on OS X: "The escape sequence \n matches a newline character embedded in the pattern space". So I think GNU sed allows\n
in the regex and in the replacement, whereas BSD only allows\n
in the regex and not in the replacement. -
Digital Trauma almost 9 yearsBorrowing the
\n
from the pattern space is a good idea and would work in the seconds///
expression, but thes/.*/&\n&/
expression is inserting a\n
into the pattern space where there wasn't one before. Also BSD sed appears to require literal newlines after label definitions and branches. -
Digital Trauma almost 9 years@mikeserv Nice. Yes, I inserted the newline I needed by doing the
G
first and changing thes///
expression. Splitting it up using-e
allows it all to go on one (long) line with no literal newlines. -
mikeserv almost 9 yearsThe
\n
escape is spec'd forsed
's LHS, too, and i think that is the spec's statement verbatim, except that POSIX bracket expressions are also spec'd in such a way that all characters lose their special meaning - (explicitly including\\
) - within one excepting the brackets, the dash as a range separator, and dot, equals, caret, colon for collation, equivalence, negation, and classes. -
mikeserv almost 9 yearsOne handy thing about newline delimmed params is they can be basically anything - and you don't even have to know what it is, so long as it is unique. It makes for some interesting options when doing...
sed ... | sed -f - ...
because you can define arbitrary branches labeled for the firstsed
's params programmatically without having to worry overmuch about syntax chars and so on. It also works forr
ead andw
rite files. -
Evgeny Vereshchagin over 8 yearsfails with
Traceback ... ValueError: min() arg is an empty sequence
on empty input. My rejected fix is here -
Stéphane Chazelas over 8 yearsAdd
-C
to measure the length in terms of number of characters instead of number of bytes. In a UTF-8 locale,$$
has fewer bytes than€
(2 vs 3), but more characters (2 vs 1). -
Ahmedov almost 8 yearsI get the following errors: """xaa:8146: unescaped " character """ and """xaa:8825: expected 1 columns but found 2 - extras ignored""" .The file consists of json documents 1 per each line.
-
agc almost 5 yearsIt'd be nice to not need an
$f
variable; I've a notion that might be possible usingtee
somehow... -
filbranden over 4 yearsFor Python3, the
print()
function takes anend=
named argument for the line end. So this is better, and equivalent to the Python2 trailing comma:print(min(sys.stdin, key=len, default=''), end='')
-
Marcello de Sales over 2 yearsAbsolutely amazing! Found the root API of Swagger API output :)
curl -s http://localhost:8750/swagger/docs/v2 | jq -r '.paths | keys[]' | awk '{ print length, $0 }' | sort -n | cut -d" " -f2- | head -1