Count line lengths in file using command line tools
Solution 1
This
- counts the line lengths using
awk
, then - sorts the (numeric) line lengths using
sort -n
and finally - counts the unique line length values
uniq -c
.
$ awk '{print length}' input.txt | sort -n | uniq -c
1 1
2 2
3 4
1 5
2 6
2 7
In the output, the first column is the number of lines with the given length, and the second column is the line length.
Solution 2
Pure awk
awk '{++a[length()]} END{for (i in a) print i, a[i]}' file.txt
4 3
5 1
6 2
7 2
1 1
2 2
Solution 3
Using bash
arrays:
#!/bin/bash
while read line; do
((histogram[${#line}]++))
done < file.txt
echo "Length Occurrence"
for length in "${!histogram[@]}"; do
printf "%-6s %s\n" "${length}" "${histogram[$length]}"
done
Example run:
$ ./t.sh
Length Occurrence
1 1
2 2
4 3
5 1
6 2
7 2
Solution 4
$ perl -lne '$c{length($_)}++ }{ print qq($_ $c{$_}) for (keys %c);' file.txt
Output
6 2
1 1
4 3
7 2
2 2
5 1
Solution 5
If you allow for the columns to be swapped and don't need the headers, something as easy as
while read line; do echo -n "$line" | wc -m; done < file | sort | uniq -c
(without any advanced tricks with sed
or awk
) will work. The output is:
1 1
2 2
3 4
1 5
2 6
2 7
One important thing to keep in mind: wc -c
counts the bytes, not the characters, and will not give the correct length for strings containing multibyte characters. Therefore the use of wc -m
.
References:
Related videos on Youtube
Pete Hamilton
Engineer @ Monzo. Previously GoCardless, Amazon & Next Jump
Updated on March 31, 2022Comments
-
Pete Hamilton about 2 years
Problem
If I have a long file with lots of lines of varying lengths, how can I count the occurrences of each line length?
Example:
file.txt
this is a sample file with several lines of varying length
Running
count_line_lengths file.txt
would give:Length Occurences 1 1 2 2 4 3 5 1 6 2 7 2
Ideas?
-
Bill almost 11 yearshow do you know
length=1
is for which word? you should store the word too. -
Pete Hamilton almost 11 yearsLanguage: Preferably using a clever shell command. I could easily do this in something like Ruby or Python, but that's no fun ;)
-
Pete Hamilton almost 11 years@Bill I don't really care about the word, only the line lengths, unless I misunderstood your question?
-
-
Anders Johansson almost 11 yearsOr shorter:
awk '{print length}' input.txt | sort | uniq -c
-
Adrian Frühwirth almost 11 years@fedorqui It's not really portable though, so depending on the use case
awk
wins ;-) Just posted it because the OP specifically asked for something not involving another external language, which kind of also meansawk
(that's how I read it). On the upside, it's not even so much longer if you considerwhile read l;do((h[${#l}]++));done<file.txt;for l in "${!h[@]}";do echo "$l ${h[$l]}";done
... -
glenn jackman almost 11 yearsfor golfing fun:
perl -lnE '$c{+length}++}{say "$_ $c{$_}" for keys %c'
-
TrueY almost 11 yearsNice pipe snake, but counting and
uniq
could be done insideawk
easily. I suppose sort also can be done ingawk
. I prefer the purebash
solution. -
Randall Cook about 10 yearsI had a file with a pathologically long line (700-1000MB) and of all the one-liners here, only this one didn't crash. +1!
-
user82116 about 9 yearsI did this but we have really long lines and sort doesn't sort numbers correctly by default (I got output like
1 9575 1 999
with this. To correctly sort numbers usesort -g
, making the originalawk '{print length}' input.txt | sort -g | uniq -c
-
imrek over 4 years
wc -c
counts the bytes, not the characters. If you have multibyte characters, you'll get larger numbers. Tryecho -n "你好" | wc -c
vs. ` echo -n "你好" | wc -m`. -
Maksym Ganenko over 4 years@DrunkenMaster You must be right, should I just replace
wc -c
withwc -m
? -
imrek over 4 yearsI think it will be clear now for anyone reading your answer, it's enough to refer to the comment above.
-
Hashim Aziz over 3 years@user82116 I believe replacing that
sort
command withLC_ALL=C sort
would have the advantages of sorting characters properly too, as well as being faster. -
Chris Noe almost 3 yearsThis does not count trailing spaces. $line needs to be quoted.