Find rare characters with Python
Solution 1
from collections import Counter
c = Counter("text")
print(c.most_common())
output
[('t', 2), ('e', 1), ('x', 1)]
Solution 2
d = {}
for c in open(filename, "r").read():
if c in d:
d[c] += 1
else:
d[c] = 1
print(d)
Then you can use d
to search for the minimum letters.
Solution 3
Here's one way to do this, using a Counter
dictionary. It prints the rare characters, along with their number of occurrences. We define a rare character to be one whose number of occurrences is less than a certain threshold, which is the mean number of occurrences multiplied by a weighting factor, which I've set to 0.5 in this example.
from collections import Counter
with open(fname, 'r') as f:
text = f.read()
counter = Counter(text)
mean = len(text) / len(counter)
print('Mean:', mean)
weight = 0.5
thresh = mean * weight
print('Threshold:', thresh)
#Only print results for chars whose occurence is less than the threshold
for ch, count in reversed(counter.most_common()):
if count <= thresh:
print('{0!r}: {1}'.format(ch, count))
else:
break
If this is an actual text file you may wish to filter out certain characters, eg newlines and spaces.
Solution 4
Using the collections
option to access the n least common elements c.most_common()[:-n-1:-1]
from collections import Counter
c = Counter("sadaffdsagfgdfaafsasdfs3213jlkjk22jl31j2k13j313j13")
res = c.most_common()[:-3-1:-1]
print "The 3 Rarest characters are:",res[0][0],",",res[1][0],"and",res[2][0]
Result:
The 3 Rarest characters are: l , g and k
Solution 5
To find 10 rarest characters in a text:
from collections import Counter
rarest_chars = Counter(text).most_common()[-10:]
"character" means a Unicode codepoint here for simplicity: It means "a"
and "A"
are considered as different characters. It means u'g̈'
(U+0067 U+0308) is considered as two characters. See how these issues are handled in a related question: Most common character in a string.
counter.most_common()[-10:]
could be written more efficiently using heapq.nsmallest(10, counter.items(), key=itemgetter(1))
: .items()
returns pairs (character, its_count)
and key=itemgetter(1)
extracts the counts so that 10
pairs with the least counts are returned.
EM90
Updated on June 17, 2022Comments
-
EM90 almost 2 years
Assume I have a huge
.txt
file full of random characters and I want to find out the "rare ones". Is there some module (something at all, actually) in Python (possibly, version3.x
, but I have also a machine using Python2.7
, in case it's better) written for this purpose? In case of positive answer, where can I find some basic explanation of its functioning? Thank you very much. -
Duncan about 8 yearsThe OP asked for rare, not most common.
-
Paul Cornelius about 8 years@Duncan just reverse the list.
-
Markus Meskanen about 8 years@Duncan They're equal, just different order.
-
Duncan about 8 yearsI know that, just think the answer is incomplete unless it actually says it.
-
Luka Rahne about 8 yearswhy sorting sorted list?
-
Rolf of Saxony about 8 yearsI'm sorting on the numerical value of the occurrence rather than the alphabetical character.
-
jfs about 8 years@Duncan: the rarest char is:
c.most_common()[-1]
. -
jfs about 8 years1-
.most_common()
already returns the pairs sorted by the number of occurrences e.g.,.most_common()[-1]
is the rarest character -- no need to call additionalsort()
, to get 2 rarest characters. 2- you could useoperator.itemgetter(1)
instead of definingoccur()
function 3- all your strings are bytestrings. You should use Unicode when handling text. 4-heapq.nsmallest()
can be more efficient than calling `.most_common() -
jfs about 8 yearsI meant to say: "strings are bytestrings" in your code and you should use Unicode strings instead.
-
Rolf of Saxony about 8 years@J.F.Sebastian Altered to fully utilise the functionality with the collections module