Count line lengths in file using command line tools

75,792

Solution 1

This

  • counts the line lengths using awk, then
  • sorts the (numeric) line lengths using sort -n and finally
  • counts the unique line length values uniq -c.
$ awk '{print length}' input.txt | sort -n | uniq -c
      1 1
      2 2
      3 4
      1 5
      2 6
      2 7

In the output, the first column is the number of lines with the given length, and the second column is the line length.

Solution 2

Pure awk

awk '{++a[length()]} END{for (i in a) print i, a[i]}' file.txt

4 3
5 1
6 2
7 2
1 1
2 2

Solution 3

Using bash arrays:

#!/bin/bash

while read line; do
    ((histogram[${#line}]++))
done < file.txt

echo "Length Occurrence"
for length in "${!histogram[@]}"; do
    printf "%-6s %s\n" "${length}" "${histogram[$length]}"
done

Example run:

$ ./t.sh
Length Occurrence
1      1
2      2
4      3
5      1
6      2
7      2

Solution 4

$ perl -lne '$c{length($_)}++ }{ print qq($_ $c{$_}) for (keys %c);' file.txt

Output

6 2
1 1
4 3
7 2
2 2
5 1

Solution 5

If you allow for the columns to be swapped and don't need the headers, something as easy as

while read line; do echo -n "$line" | wc -m; done < file | sort | uniq -c

(without any advanced tricks with sed or awk) will work. The output is:

1 1
2 2
3 4
1 5
2 6
2 7

One important thing to keep in mind: wc -c counts the bytes, not the characters, and will not give the correct length for strings containing multibyte characters. Therefore the use of wc -m.

References:

man uniq(1)

man sort(1)

man wc(1)

Share:
75,792

Related videos on Youtube

Pete Hamilton
Author by

Pete Hamilton

Engineer @ Monzo. Previously GoCardless, Amazon &amp; Next Jump

Updated on March 31, 2022

Comments

  • Pete Hamilton
    Pete Hamilton about 2 years

    Problem

    If I have a long file with lots of lines of varying lengths, how can I count the occurrences of each line length?

    Example:

    file.txt

    this
    is
    a
    sample
    file
    with
    several
    lines
    of
    varying
    length
    

    Running count_line_lengths file.txt would give:

    Length Occurences
    1      1
    2      2
    4      3
    5      1
    6      2
    7      2
    

    Ideas?

    • Bill
      Bill almost 11 years
      how do you know length=1 is for which word? you should store the word too.
    • Pete Hamilton
      Pete Hamilton almost 11 years
      Language: Preferably using a clever shell command. I could easily do this in something like Ruby or Python, but that's no fun ;)
    • Pete Hamilton
      Pete Hamilton almost 11 years
      @Bill I don't really care about the word, only the line lengths, unless I misunderstood your question?
  • Anders Johansson
    Anders Johansson almost 11 years
    Or shorter: awk '{print length}' input.txt | sort | uniq -c
  • Adrian Frühwirth
    Adrian Frühwirth almost 11 years
    @fedorqui It's not really portable though, so depending on the use case awk wins ;-) Just posted it because the OP specifically asked for something not involving another external language, which kind of also means awk (that's how I read it). On the upside, it's not even so much longer if you consider while read l;do((h[${#l}]++));done<file.txt;for l in "${!h[@]}";do echo "$l ${h[$l]}";done ...
  • glenn jackman
    glenn jackman almost 11 years
    for golfing fun: perl -lnE '$c{+length}++}{say "$_ $c{$_}" for keys %c'
  • TrueY
    TrueY almost 11 years
    Nice pipe snake, but counting and uniq could be done inside awk easily. I suppose sort also can be done in gawk. I prefer the pure bash solution.
  • Randall Cook
    Randall Cook about 10 years
    I had a file with a pathologically long line (700-1000MB) and of all the one-liners here, only this one didn't crash. +1!
  • user82116
    user82116 about 9 years
    I did this but we have really long lines and sort doesn't sort numbers correctly by default (I got output like 1 9575 1 999 with this. To correctly sort numbers use sort -g, making the original awk '{print length}' input.txt | sort -g | uniq -c
  • imrek
    imrek over 4 years
    wc -c counts the bytes, not the characters. If you have multibyte characters, you'll get larger numbers. Try echo -n "你好" | wc -c vs. ` echo -n "你好" | wc -m`.
  • Maksym Ganenko
    Maksym Ganenko over 4 years
    @DrunkenMaster You must be right, should I just replace wc -c with wc -m ?
  • imrek
    imrek over 4 years
    I think it will be clear now for anyone reading your answer, it's enough to refer to the comment above.
  • Hashim Aziz
    Hashim Aziz over 3 years
    @user82116 I believe replacing that sort command with LC_ALL=C sort would have the advantages of sorting characters properly too, as well as being faster.
  • Chris Noe
    Chris Noe almost 3 years
    This does not count trailing spaces. $line needs to be quoted.