Count line lengths in file using command line tools

bash shell command-line scripting

75,792

Solution 1

This

counts the line lengths using awk, then
sorts the (numeric) line lengths using sort -n and finally
counts the unique line length values uniq -c.

$ awk '{print length}' input.txt | sort -n | uniq -c
      1 1
      2 2
      3 4
      1 5
      2 6
      2 7

In the output, the first column is the number of lines with the given length, and the second column is the line length.

Solution 2

Pure awk

awk '{++a[length()]} END{for (i in a) print i, a[i]}' file.txt

4 3
5 1
6 2
7 2
1 1
2 2

Solution 3

Using bash arrays:

#!/bin/bash

while read line; do
    ((histogram[${#line}]++))
done < file.txt

echo "Length Occurrence"
for length in "${!histogram[@]}"; do
    printf "%-6s %s\n" "${length}" "${histogram[$length]}"
done

Example run:

$ ./t.sh
Length Occurrence
1      1
2      2
4      3
5      1
6      2
7      2

Solution 4

$ perl -lne '$c{length($_)}++ }{ print qq($_ $c{$_}) for (keys %c);' file.txt

Output

Solution 5

If you allow for the columns to be swapped and don't need the headers, something as easy as

while read line; do echo -n "$line" | wc -m; done < file | sort | uniq -c

(without any advanced tricks with sed or awk) will work. The output is:

One important thing to keep in mind: wc -c counts the bytes, not the characters, and will not give the correct length for strings containing multibyte characters. Therefore the use of wc -m.

References:

man uniq(1)

man sort(1)

man wc(1)

View more solutions

75,792

Pete Hamilton

Engineer @ Monzo. Previously GoCardless, Amazon & Next Jump

Updated on March 31, 2022

Comments

Pete Hamilton about 2 years
Problem

If I have a long file with lots of lines of varying lengths, how can I count the occurrences of each line length?

Example:

file.txt
```
this
is
a
sample
file
with
several
lines
of
varying
length
```
Running count_line_lengths file.txt would give:
```
Length Occurences
1      1
2      2
4      3
5      1
6      2
7      2
```
Ideas?
- Bill almost 11 years
  
  how do you know length=1 is for which word? you should store the word too.
- Pete Hamilton almost 11 years
  
  Language: Preferably using a clever shell command. I could easily do this in something like Ruby or Python, but that's no fun ;)
- Pete Hamilton almost 11 years
  
  @Bill I don't really care about the word, only the line lengths, unless I misunderstood your question?
Anders Johansson almost 11 years

Or shorter: awk '{print length}' input.txt | sort | uniq -c
Adrian Frühwirth almost 11 years

@fedorqui It's not really portable though, so depending on the use case awk wins ;-) Just posted it because the OP specifically asked for something not involving another external language, which kind of also means awk (that's how I read it). On the upside, it's not even so much longer if you consider while read l;do((h[${#l}]++));done<file.txt;for l in "${!h[@]}";do echo "$l ${h[$l]}";done ...
glenn jackman almost 11 years

for golfing fun: perl -lnE '$c{+length}++}{say "$_ $c{$_}" for keys %c'
TrueY almost 11 years

Nice pipe snake, but counting and uniq could be done inside awk easily. I suppose sort also can be done in gawk. I prefer the pure bash solution.
Randall Cook about 10 years

I had a file with a pathologically long line (700-1000MB) and of all the one-liners here, only this one didn't crash. +1!
user82116 about 9 years

I did this but we have really long lines and sort doesn't sort numbers correctly by default (I got output like 1 9575 1 999 with this. To correctly sort numbers use sort -g, making the original awk '{print length}' input.txt | sort -g | uniq -c
imrek over 4 years

wc -c counts the bytes, not the characters. If you have multibyte characters, you'll get larger numbers. Try echo -n "你好" | wc -c vs. ` echo -n "你好" | wc -m`.
Maksym Ganenko over 4 years

@DrunkenMaster You must be right, should I just replace wc -c with wc -m ?
imrek over 4 years

I think it will be clear now for anyone reading your answer, it's enough to refer to the comment above.
Hashim Aziz over 3 years

@user82116 I believe replacing that sort command with LC_ALL=C sort would have the advantages of sorting characters properly too, as well as being faster.
Chris Noe almost 3 years

This does not count trailing spaces. $line needs to be quoted.