How to maintain dictionary in a heap in python?
Solution 1
Using heapq
you probably want to do something like this:
heap = [(-value, key) for key,value in the_dict.items()]
largest = heapq.nsmallest(10, heap)
largest = [(key, -value) for value, key in largest]
Note that since heapq
implements only a min heap it's better to invert the values, so that bigger values become smaller.
This solution will be slower for small sizes of the heap, for example:
>>> import random
>>> import itertools as it
>>> def key_generator():
... characters = [chr(random.randint(65, 90)) for x in range(100)]
... for i in it.count():
... yield ''.join(random.sample(characters, 3))
...
>>> the_dict = dict((key, random.randint(-500, 500)) for key, _ in zip(key_generator(), range(3000)))
>>> def with_heapq(the_dict):
... items = [(-value, key) for key, value in the_dict.items()]
... smallest = heapq.nsmallest(10, items)
... return [-value for value, key in smallest]
...
>>> def with_sorted(the_dict):
... return sorted(the_dict.items(), key=(lambda x: x[1]), reverse=True)[:10]
...
>>> import timeit
>>> timeit.timeit('with_heapq(the_dict)', 'from __main__ import the_dict, with_heapq', number=1000)
0.9220538139343262
>>> timeit.timeit('with_sorted(the_dict)', 'from __main__ import the_dict, with_sorted', number=1000)
1.2792410850524902
With 3000 values it's just slightly faster than the sorted
version, which is O(nlogn)
instead of O(n + mlogn)
. If we increase the size of the dict to 10000 the heapq
version becomes even faster:
>>> timeit.timeit('with_heapq(the_dict)', 'from __main__ import the_dict, with_heapq', number=1000)
2.436316967010498
>>> timeit.timeit('with_sorted(the_dict)', 'from __main__ import the_dict, with_sorted', number=1000)
3.585728168487549
The timings probably depends also on the machine on which you are running. You should probably profile which solution works best in your case. If the efficiency is not critical I'd suggest to use the sorted
version because it's simpler.
Solution 2
Using heap is a best solution with time complexity: O(nlogk). where n is length of the heap and k is 10 here.
Now the trick with mapping of keys is that we can create another class for comparison of key and define magic methods __lt__()
__gt__()
. which overrides < , > operators
import heapq
class CompareWord:
def __init__(self , word , value):
self.word = word
self.value = value
def __lt__(self, other): #To override > operator
return self.value < other.value
def __gt__(self , other): #To override < operator
return self.value > other.value
def getWord(self):
return self.word
def findKGreaterValues(compare_dict , k):
min_heap = []
for word in compare_dict:
heapq.heappush(min_heap , CompareWord(word ,compare_dict[word] ))
if(len(min_heap) > k):
heapq.heappop(min_heap)
answer = []
for compare_word_obj in min_heap:
answer.append(compare_word_obj.getWord())
return answer
Solution 3
For getting the top 10 elements, assuming that the number is in the second place:
from operator import itemgetter
topten = sorted(mydict.items(), key=itemgetter(1), reverse = True)[0:10]
if you want to sort by value then key just change it to key=itemgetter(1,0)
.
As for a data structure, a heap sounds like what you would want. Just keep them as tuples, and compare the number term.
Solution 4
you can implement the lt function in your class where you can specify which attributes should be compared.
def __lt__(self, other):
return self.attribute if self.attribute < other else other
Solution 5
If the dictionary remains constant, then instead of trying to create a heapq directly or through collections.Counter, you can try to sort the dictionary items using the value as key in reverse order and then get the first 10 elements from it. You need to recreate the dictionary from the tuples
>>> some_dict = {string.ascii_lowercase[random.randint(0,23):][:3]:random.randint(100,300) for _ in range(100)}
>>> some_dict
{'cde': 262, 'vwx': 175, 'xyz': 163, 'uvw': 288, 'qrs': 121, 'mno': 192, 'ijk': 103, 'abc': 212, 'wxy': 206, 'efg': 256, 'opq': 255, 'tuv': 128, 'jkl': 158, 'pqr': 291, 'fgh': 191, 'lmn': 259, 'rst': 140, 'hij': 192, 'nop': 202, 'bcd': 258, 'klm': 145, 'stu': 293, 'ghi': 264, 'def': 260}
>>> sorted(some_dict.items(), key = operator.itemgetter(1), reverse = True)[:10]
[('stu', 293), ('pqr', 291), ('uvw', 288), ('ghi', 264), ('cde', 262), ('def', 260), ('lmn', 259), ('bcd', 258), ('efg', 256), ('opq', 255)]
If you are using heapq, to create the heap, you need nlogn
operations, if you are building a heap by inserting the elements or logn
if you heapify a list, followed by mlogn
operations to fetch the top m
elements
If you are sorting the items, Python Sorting algorithm is guaranteed to be O(nlogn)
in worst case (refer TIM Sort) and fetching the first 10 elements would be a constant operations
Related videos on Youtube
gizgok
Updated on July 09, 2022Comments
-
gizgok almost 2 years
I have a dictionary as follows:
{'abc':100,'xyz':200,'def':250 .............}
It is a dictionary with keys as a name of a entity and the value is count of that entity. I need to return the top 10 elements of from the dictionary.
I can write a heap to do it, but I'm not sure how to do value to key mapping as certain values will be equal.
Is there any other data structure to do this?
-
dawg about 11 yearsDoes 'top 10 elements' mean the greatest 10 values?
-
gizgok about 11 yearsYes it means greatest values
-
Bakuriu about 11 yearsHow should the duplicates be handled? I mean, should we simply ignore them as if they were different values, or should we keep only one of the duplicated values?
-
-
Bakuriu about 11 yearsActually building the heap is
O(n)
and notO(nlogn)
. From the documentation: "heapify(x) # transforms list into a heap, in-place, in linear time" -
Paul Hankin about 11 years+1, but the call to heapify is redundant and in fact,
heapq.nlargest(the_dict.items())
is all you need. -
gizgok about 11 yearsIn worst case quick select can go till O(n^2), though that does not happen far too often unless the pivot selected is the max element. I had median of medians and quick select implemented, I also wanted to get heap to do a comparison. To overcome quick select u can take about 3 random numbers from the list and take their median as the pivot.