How to find line with least characters

7,414

Solution 1

A Perl way. Note that if there are many lines of the same, shortest length, this approach will only print one of them:

perl -lne '$m//=$_; $m=$_ if length()<length($m); END{print $m if $.}' file 

Explanation

  • perl -lne : -n means "read the input file line by line", -l causes trailing newlines to be removed from each input line and a newline to be added to each print call; and -e is the script that will be applied to each line.
  • $m//=$_ : set $m to the current line ($_) unless $m is defined. The //= operator is available since Perl 5.10.0.
  • $m=$_ if length()<length($m) : if the length of the current value of $m is greater than the length of the current line, save the current line ($_) as $m.
  • END{print $m if $.} : once all lines have been processed, print the current value of $m, the shortest line. The if $. ensures that this only happens when the line number ($.) is defined, avoiding printing an empty line for blank input.

Alternatively, since your file is small enough to fit in memory, you can do:

perl -e '@K=sort{length($a) <=> length($b)}<>; print "$K[0]"' file 

Explanation

  • @K=sort{length($a) <=> length($b)}<> : <> here is an array whose elements are the lines of the file. The sort will sort them according to their length and the sorted lines are saved as array @K.
  • print "$K[0]" : print the first element of array @K: the shortest line.

If you want to print all shortest lines, you can use

perl -e '@K=sort{length($a) <=> length($b)}<>; 
         print grep {length($_)==length($K[0])}@K; ' file 

Solution 2

Here's a variant of an awk solution for printing the first found minimum line:

awk '
  NR==1 || length<len {len=length; line=$0}
  END {print line}
'

which can simply be extended by one condition to print all minimum lines:

awk '
  length==len {line=line ORS $0}
  NR==1 || length<len {len=length; line=$0}
  END {print line}'
'

Solution 3

With sqlite3:

sqlite3 <<EOT
CREATE TABLE file(line);
.import "data.txt" file
SELECT line FROM file ORDER BY length(line) LIMIT 1;
EOT

Solution 4

Python comes out fairly concise, and the code Does What It Says On The Tin:

python -c "import sys; print min(sys.stdin, key=len),"

The final comma is obscure, I admit. It prevents the print statement adding an additional linebreak. Additionally, you can write this in Python 3 supporting 0 lines like:

python3 -c "import sys; print(min(sys.stdin, key=len, default='').strip('\n'))"

Solution 5

I always love solutions with pure shell scripting (no exec!).

#!/bin/bash
min=
is_empty_input="yes"

while IFS= read -r a; do
    if [ -z "$min" -a "$is_empty_input" = "yes" ] || [ "${#a}" -lt "${#min}" ]; then
        min="$a"
    fi
    is_empty_input="no"
done

if [ -n "$a" ]; then
    if [ "$is_empty_input" = "yes" ]; then
        min="$a"
        is_empty_input="no"
    else
        [ "${#a}" -lt "${#min}" ] && min="$a"
    fi
fi

[ "$is_empty_input" = "no" ] && printf '%s\n' "$min"

Note:

There is a problem with NUL bytes in the input. So, printf "ab\0\0\ncd\n" | bash this_script prints ab instead of cd.

Share:
7,414

Related videos on Youtube

Matthew D. Scholefield
Author by

Matthew D. Scholefield

I'm a passionate open source developer. Whenever I find something I don't understand, I rebuild it. Aside from programming, I occasionally do 3D graphics (With Blender), and a little music composition.

Updated on September 18, 2022

Comments

  • Matthew D. Scholefield
    Matthew D. Scholefield over 1 year

    I am writing a shell script, using any general UNIX commands. I have to retrieve the line that has the least characters (whitespace included). There can be up to around 20 lines.

    I know I can use head -$L | tail -1 | wc -m to find the character count of line L. The problem is, the only method I can think of, using that, would be to manually write a mess of if statements, comparing the values.

    Example data:

    seven/7
    4for
    8 eight?
    five!
    

    Would return 4for since that line had the least characters.

    In my case, if multiple lines have the shortest length, a single one should be returned. It does not matter which one is selected, as long as it is of the minimum length. But I don't see the harm in showing both ways for other users with other situations.

    • chaos
      chaos almost 9 years
      What if there are multiple line with length of 4? Should they be printed too?
    • Matthew D. Scholefield
      Matthew D. Scholefield almost 9 years
      In my case, if multiple lines have the shortest length, a single one should be returned. It does not matter which one is selected, as long as it is of the minimum length. But I don't see the harm in showing both ways for other users with other situations.
  • Thushi
    Thushi almost 9 years
    +1 for the logic but it won't work in all the cases. If the two lines are having the same number of characters and which is minimum. It will give you only the first line which is encountered because of head -1
  • Thushi
    Thushi almost 9 years
    It won't work if more than one line is having the same number of characters and which is also minimum.
  • cuonglm
    cuonglm almost 9 years
    @Thushi: It will report the first minimum line.
  • Thushi
    Thushi almost 9 years
    Yeah.But that's not correct output right? Even the other lines are having the minimum number of characters.
  • cuonglm
    cuonglm almost 9 years
    @Thushi: That doesn't mention in OP requirement, waiting update from OP.
  • Thushi
    Thushi almost 9 years
    Ok.No problem. It is just a general use case which I mentioned(Something like implicit use cases/requirements). Anyhow we will wait for him. :)
  • Toby Speight
    Toby Speight almost 9 years
    To get the longest line, it's a bit more efficient to reverse the sort than to use tail (as head can exit as soon as its job is done, without reading the rest of its input).
  • fedorqui
    fedorqui almost 9 years
    I don't think L was the best letter to chose to name the variable :D Something like min would make things more clear
  • chaos
    chaos almost 9 years
    That one is my favorite here, never thought of SQL...
  • Matthew D. Scholefield
    Matthew D. Scholefield almost 9 years
    @Thushi Using a bit of regex, after printing line numbers, everything but the lines with the same number as line 1, could be removed, thus outputting all of the shortest lines.
  • mikeserv
    mikeserv almost 9 years
    what does the tin say?
  • Steve Jessop
    Steve Jessop almost 9 years
    @mikeserve: it says, "prints the minimum of sys.stdin, using len as the key" ;-)
  • mikeserv
    mikeserv almost 9 years
    ahh. nothing about binary size, dependency creep or execution time, then?
  • Steve Jessop
    Steve Jessop almost 9 years
    @mikeserv: no, the small print isn't on the tin. It's on an advisory leaflet in a locked filing cabinet, in a cellar, behind a door marked "beware of the leopard".
  • mikeserv
    mikeserv almost 9 years
    Gotcha - so on display.
  • cuonglm
    cuonglm almost 9 years
    You can use push @{$lines{+length}}; and print @{$lines{+min keys %lines}}; for less typing :)
  • Angel Todorov
    Angel Todorov almost 9 years
    If I was golfing, I wouldn't have used the variable name "lines" either: perl -MList::Util=min -nE'push @{$l{+length}},$_}END{say@{$l{min keys%l}}' sample
  • shadowtalker
    shadowtalker almost 9 years
    This is code golf status clever
  • Peter.O
    Peter.O almost 9 years
    +1 for a non-golfed version (which works!), though for only the print all variant. – perl gets a bit gnarly for those of us who aren't up to par.with perl's cryptic nature. BTW. the golfed say prints a spurious blank line at the end.of the output.
  • Digital Trauma
    Digital Trauma almost 9 years
    (( ${#a} < ${#min} )) is possibly cleaner than [ "${#a}" -lt "${#min}" ]. Its unusual, but in this case the double quotes around the string length expansions are not necessary - string length will always be a contiguous string of digits.
  • Digital Trauma
    Digital Trauma almost 9 years
  • mikeserv
    mikeserv almost 9 years
    Have you tried benching your no exec! solution versus others which do? Here's a comparison of the performance differences between exec! and no exec! solutions for a similar problem. execing a separate process is very seldom advantageous when it spiders - in forms like var=$(get data) because it restricts the data flow to a single context - but when you move data through a pipeline - in a stream - each applied exec is generally helpful - because it enables specialized application of modular programs only where necessary.
  • Digital Trauma
    Digital Trauma almost 9 years
    @mikeserv Yes I hadn't considered possible effects of $IFS
  • Digital Trauma
    Digital Trauma almost 9 years
    @mikeserv Yes I think expr is nicer here. Yes, e will spawn a shell for each line. I edited the sed expression so that it replaces each char in the string with a : before the eval which I think should remove any possibility of code injection.
  • Digital Trauma
    Digital Trauma almost 9 years
    Do you even need to insert line numbers? My reading of the OP is that just the shortest line is required, and not necessarily the line number of that line. I guess no harm in showing it for completeness.
  • Stéphane Chazelas
    Stéphane Chazelas almost 9 years
  • mikeserv
    mikeserv almost 9 years
    I would usually opt for xargs expr personally - but, other than avoiding an intermediate shell, that's probably more a stylistic thing. I like it, anyway.
  • mikeserv
    mikeserv almost 9 years
    @DigitalTrauma - nah, probably not. But it is hardly very useful without them - and they come so cheaply. When working a stream i always prefer to include a means of reproducing the original input identically in the output - the line-numbers make that possible here. For example, to turn the results of the first pipeline around: REINPUT | sort -t: -nk1,1 | cut -d: -f3-. And the second is a simple matter of including another sed --expression script at the tail.
  • mikeserv
    mikeserv almost 9 years
    @DigitalTrauma - oh, and in the first example the line numbers do affect sort's behavior as a tie-breaker when same-length lines occur in input - so the earliest occurring line always floats to the top in that case.
  • yaegashi
    yaegashi almost 9 years
    Thank you all for the comments and upvotes (some of the rep should go to @cuonglm for correcting my answer). Generally I don't recommend others to daily practice pure shell scripting but that skill can be found very useful in some extreme conditions where nothing other than static linked /bin/sh is available. It's happened to me several times with SunOS4 hosts with /usr lost or some .so damaged, and now in modern Linux age I still occasionally encounter similar situations with embedded systems or initrd of boot failing systems. BusyBox is one of the great things we recently acquired.
  • John Kugelman
    John Kugelman almost 9 years
    Will this read the entire file into memory and/or create a second on-disk copy? If so, it's clever but inefficient.
  • FloHimself
    FloHimself almost 9 years
    @JohnKugelman This will probably soak up the whole 4 lines into a temporary memory only database (that is what strace indicates). If you need to work with really large files (and your system isn't swapping), you can force it by just appending a filename like sqlite3 $(mktemp) and all data will be written to disk.
  • Digital Trauma
    Digital Trauma almost 9 years
    @mikeserv From man sed on OS X: "The escape sequence \n matches a newline character embedded in the pattern space". So I think GNU sed allows \n in the regex and in the replacement, whereas BSD only allows \n in the regex and not in the replacement.
  • Digital Trauma
    Digital Trauma almost 9 years
    Borrowing the \n from the pattern space is a good idea and would work in the second s/// expression, but the s/.*/&\n&/ expression is inserting a \n into the pattern space where there wasn't one before. Also BSD sed appears to require literal newlines after label definitions and branches.
  • Digital Trauma
    Digital Trauma almost 9 years
    @mikeserv Nice. Yes, I inserted the newline I needed by doing the G first and changing the s/// expression. Splitting it up using -e allows it all to go on one (long) line with no literal newlines.
  • mikeserv
    mikeserv almost 9 years
    The \n escape is spec'd for sed's LHS, too, and i think that is the spec's statement verbatim, except that POSIX bracket expressions are also spec'd in such a way that all characters lose their special meaning - (explicitly including \\) - within one excepting the brackets, the dash as a range separator, and dot, equals, caret, colon for collation, equivalence, negation, and classes.
  • mikeserv
    mikeserv almost 9 years
    One handy thing about newline delimmed params is they can be basically anything - and you don't even have to know what it is, so long as it is unique. It makes for some interesting options when doing... sed ... | sed -f - ... because you can define arbitrary branches labeled for the first sed's params programmatically without having to worry overmuch about syntax chars and so on. It also works for read and write files.
  • Evgeny Vereshchagin
    Evgeny Vereshchagin over 8 years
    fails with Traceback ... ValueError: min() arg is an empty sequence on empty input. My rejected fix is here
  • Stéphane Chazelas
    Stéphane Chazelas over 8 years
    Add -C to measure the length in terms of number of characters instead of number of bytes. In a UTF-8 locale, $$ has fewer bytes than (2 vs 3), but more characters (2 vs 1).
  • Ahmedov
    Ahmedov almost 8 years
    I get the following errors: """xaa:8146: unescaped " character """ and """xaa:8825: expected 1 columns but found 2 - extras ignored""" .The file consists of json documents 1 per each line.
  • agc
    agc almost 5 years
    It'd be nice to not need an $f variable; I've a notion that might be possible using tee somehow...
  • filbranden
    filbranden over 4 years
    For Python3, the print() function takes an end= named argument for the line end. So this is better, and equivalent to the Python2 trailing comma: print(min(sys.stdin, key=len, default=''), end='')
  • Marcello de Sales
    Marcello de Sales over 2 years
    Absolutely amazing! Found the root API of Swagger API output :) curl -s http://localhost:8750/swagger/docs/v2 | jq -r '.paths | keys[]' | awk '{ print length, $0 }' | sort -n | cut -d" " -f2- | head -1