Find rare characters with Python

14,589

Solution 1

from collections import Counter

c = Counter("text")
print(c.most_common())

output

[('t', 2), ('e', 1), ('x', 1)]

Solution 2

d = {}
for c in open(filename, "r").read():
    if c in d:
        d[c] += 1
    else:
        d[c] = 1

print(d)

Then you can use d to search for the minimum letters.

Solution 3

Here's one way to do this, using a Counter dictionary. It prints the rare characters, along with their number of occurrences. We define a rare character to be one whose number of occurrences is less than a certain threshold, which is the mean number of occurrences multiplied by a weighting factor, which I've set to 0.5 in this example.

from collections import Counter

with open(fname, 'r') as f:
    text = f.read()

counter = Counter(text)
mean = len(text) / len(counter)
print('Mean:', mean)

weight = 0.5
thresh = mean * weight
print('Threshold:', thresh)

#Only print results for chars whose occurence is less than the threshold
for ch, count in reversed(counter.most_common()):
    if count <= thresh:
        print('{0!r}: {1}'.format(ch, count))
    else:
        break

If this is an actual text file you may wish to filter out certain characters, eg newlines and spaces.

Solution 4

Using the collections option to access the n least common elements c.most_common()[:-n-1:-1]

from collections import Counter
c = Counter("sadaffdsagfgdfaafsasdfs3213jlkjk22jl31j2k13j313j13")
res = c.most_common()[:-3-1:-1]
print "The 3 Rarest characters are:",res[0][0],",",res[1][0],"and",res[2][0]

Result:

The 3 Rarest characters are: l , g and k

Solution 5

To find 10 rarest characters in a text:

from collections import Counter

rarest_chars = Counter(text).most_common()[-10:]

"character" means a Unicode codepoint here for simplicity: It means "a" and "A" are considered as different characters. It means u'g̈' (U+0067 U+0308) is considered as two characters. See how these issues are handled in a related question: Most common character in a string.

counter.most_common()[-10:] could be written more efficiently using heapq.nsmallest(10, counter.items(), key=itemgetter(1)) : .items() returns pairs (character, its_count) and key=itemgetter(1) extracts the counts so that 10 pairs with the least counts are returned.

Share:
14,589
EM90
Author by

EM90

Updated on June 17, 2022

Comments

  • EM90
    EM90 almost 2 years

    Assume I have a huge .txt file full of random characters and I want to find out the "rare ones". Is there some module (something at all, actually) in Python (possibly, version 3.x, but I have also a machine using Python 2.7, in case it's better) written for this purpose? In case of positive answer, where can I find some basic explanation of its functioning? Thank you very much.

  • Duncan
    Duncan about 8 years
    The OP asked for rare, not most common.
  • Paul Cornelius
    Paul Cornelius about 8 years
    @Duncan just reverse the list.
  • Markus Meskanen
    Markus Meskanen about 8 years
    @Duncan They're equal, just different order.
  • Duncan
    Duncan about 8 years
    I know that, just think the answer is incomplete unless it actually says it.
  • Luka Rahne
    Luka Rahne about 8 years
    why sorting sorted list?
  • Rolf of Saxony
    Rolf of Saxony about 8 years
    I'm sorting on the numerical value of the occurrence rather than the alphabetical character.
  • jfs
    jfs about 8 years
    @Duncan: the rarest char is: c.most_common()[-1].
  • jfs
    jfs about 8 years
    1- .most_common() already returns the pairs sorted by the number of occurrences e.g., .most_common()[-1] is the rarest character -- no need to call additional sort(), to get 2 rarest characters. 2- you could use operator.itemgetter(1) instead of defining occur() function 3- all your strings are bytestrings. You should use Unicode when handling text. 4- heapq.nsmallest() can be more efficient than calling `.most_common()
  • jfs
    jfs about 8 years
    I meant to say: "strings are bytestrings" in your code and you should use Unicode strings instead.
  • Rolf of Saxony
    Rolf of Saxony about 8 years
    @J.F.Sebastian Altered to fully utilise the functionality with the collections module