Find rare characters with Python

python algorithm python-2.7 python-3.x

14,589

Solution 1

from collections import Counter

c = Counter("text")
print(c.most_common())

output

[('t', 2), ('e', 1), ('x', 1)]

Solution 2

d = {}
for c in open(filename, "r").read():
    if c in d:
        d[c] += 1
    else:
        d[c] = 1

print(d)

Then you can use d to search for the minimum letters.

Solution 3

Here's one way to do this, using a Counter dictionary. It prints the rare characters, along with their number of occurrences. We define a rare character to be one whose number of occurrences is less than a certain threshold, which is the mean number of occurrences multiplied by a weighting factor, which I've set to 0.5 in this example.

from collections import Counter

with open(fname, 'r') as f:
    text = f.read()

counter = Counter(text)
mean = len(text) / len(counter)
print('Mean:', mean)

weight = 0.5
thresh = mean * weight
print('Threshold:', thresh)

#Only print results for chars whose occurence is less than the threshold
for ch, count in reversed(counter.most_common()):
    if count <= thresh:
        print('{0!r}: {1}'.format(ch, count))
    else:
        break

If this is an actual text file you may wish to filter out certain characters, eg newlines and spaces.

Solution 4

Using the collections option to access the n least common elements c.most_common()[:-n-1:-1]

from collections import Counter
c = Counter("sadaffdsagfgdfaafsasdfs3213jlkjk22jl31j2k13j313j13")
res = c.most_common()[:-3-1:-1]
print "The 3 Rarest characters are:",res[0][0],",",res[1][0],"and",res[2][0]

Result:

The 3 Rarest characters are: l , g and k

Solution 5

To find 10 rarest characters in a text:

from collections import Counter

rarest_chars = Counter(text).most_common()[-10:]

"character" means a Unicode codepoint here for simplicity: It means "a" and "A" are considered as different characters. It means u'g̈' (U+0067 U+0308) is considered as two characters. See how these issues are handled in a related question: Most common character in a string.

counter.most_common()[-10:] could be written more efficiently using heapq.nsmallest(10, counter.items(), key=itemgetter(1)) : .items() returns pairs (character, its_count) and key=itemgetter(1) extracts the counts so that 10 pairs with the least counts are returned.

View more solutions

14,589

Author by

EM90

Updated on June 17, 2022

Comments

EM90 almost 2 years

Assume I have a huge .txt file full of random characters and I want to find out the "rare ones". Is there some module (something at all, actually) in Python (possibly, version 3.x, but I have also a machine using Python 2.7, in case it's better) written for this purpose? In case of positive answer, where can I find some basic explanation of its functioning? Thank you very much.
Duncan about 8 years

The OP asked for rare, not most common.
Paul Cornelius about 8 years

@Duncan just reverse the list.
Markus Meskanen about 8 years

@Duncan They're equal, just different order.
Duncan about 8 years

I know that, just think the answer is incomplete unless it actually says it.
Luka Rahne about 8 years

why sorting sorted list?
Rolf of Saxony about 8 years

I'm sorting on the numerical value of the occurrence rather than the alphabetical character.
jfs about 8 years

@Duncan: the rarest char is: c.most_common()[-1].
jfs about 8 years

1- .most_common() already returns the pairs sorted by the number of occurrences e.g., .most_common()[-1] is the rarest character -- no need to call additional sort(), to get 2 rarest characters. 2- you could use operator.itemgetter(1) instead of defining occur() function 3- all your strings are bytestrings. You should use Unicode when handling text. 4- heapq.nsmallest() can be more efficient than calling `.most_common()
jfs about 8 years

I meant to say: "strings are bytestrings" in your code and you should use Unicode strings instead.
Rolf of Saxony about 8 years

@J.F.Sebastian Altered to fully utilise the functionality with the collections module