How to maintain dictionary in a heap in python?

23,209

Solution 1

Using heapq you probably want to do something like this:

heap = [(-value, key) for key,value in the_dict.items()]
largest = heapq.nsmallest(10, heap)
largest = [(key, -value) for value, key in largest]

Note that since heapq implements only a min heap it's better to invert the values, so that bigger values become smaller.

This solution will be slower for small sizes of the heap, for example:

>>> import random
>>> import itertools as it
>>> def key_generator():
...     characters = [chr(random.randint(65, 90)) for x in range(100)]
...     for i in it.count():
...             yield ''.join(random.sample(characters, 3))
... 
>>> the_dict = dict((key, random.randint(-500, 500)) for key, _ in zip(key_generator(), range(3000)))
>>> def with_heapq(the_dict):
...     items = [(-value, key) for key, value in the_dict.items()]
...     smallest = heapq.nsmallest(10, items)
...     return [-value for value, key in smallest]
... 
>>> def with_sorted(the_dict):
...     return sorted(the_dict.items(), key=(lambda x: x[1]), reverse=True)[:10]
... 
>>> import timeit
>>> timeit.timeit('with_heapq(the_dict)', 'from __main__ import the_dict, with_heapq', number=1000)
0.9220538139343262
>>> timeit.timeit('with_sorted(the_dict)', 'from __main__ import the_dict, with_sorted', number=1000)
1.2792410850524902

With 3000 values it's just slightly faster than the sorted version, which is O(nlogn) instead of O(n + mlogn). If we increase the size of the dict to 10000 the heapq version becomes even faster:

>>> timeit.timeit('with_heapq(the_dict)', 'from __main__ import the_dict, with_heapq', number=1000)
2.436316967010498
>>> timeit.timeit('with_sorted(the_dict)', 'from __main__ import the_dict, with_sorted', number=1000)
3.585728168487549

The timings probably depends also on the machine on which you are running. You should probably profile which solution works best in your case. If the efficiency is not critical I'd suggest to use the sorted version because it's simpler.

Solution 2

Using heap is a best solution with time complexity: O(nlogk). where n is length of the heap and k is 10 here.

Now the trick with mapping of keys is that we can create another class for comparison of key and define magic methods __lt__() __gt__(). which overrides < , > operators

import heapq
class CompareWord:
  def __init__(self , word , value):
    self.word = word
    self.value = value

  def __lt__(self, other):   #To override > operator
    return self.value < other.value

  def __gt__(self , other):  #To override < operator
    return self.value > other.value

  def getWord(self):
    return self.word

def findKGreaterValues(compare_dict , k):
  min_heap = []
  for word in compare_dict:
      heapq.heappush(min_heap , CompareWord(word ,compare_dict[word] ))
      if(len(min_heap) > k):
          heapq.heappop(min_heap)   
  answer = []
  for compare_word_obj in min_heap:
      answer.append(compare_word_obj.getWord())

  return answer

Solution 3

For getting the top 10 elements, assuming that the number is in the second place:

from operator import itemgetter

topten = sorted(mydict.items(), key=itemgetter(1), reverse = True)[0:10]

if you want to sort by value then key just change it to key=itemgetter(1,0).

As for a data structure, a heap sounds like what you would want. Just keep them as tuples, and compare the number term.

Solution 4

you can implement the lt function in your class where you can specify which attributes should be compared.

def __lt__(self, other):
   return self.attribute if self.attribute < other else other


Solution 5

If the dictionary remains constant, then instead of trying to create a heapq directly or through collections.Counter, you can try to sort the dictionary items using the value as key in reverse order and then get the first 10 elements from it. You need to recreate the dictionary from the tuples

>>> some_dict = {string.ascii_lowercase[random.randint(0,23):][:3]:random.randint(100,300) for _ in range(100)}
>>> some_dict
{'cde': 262, 'vwx': 175, 'xyz': 163, 'uvw': 288, 'qrs': 121, 'mno': 192, 'ijk': 103, 'abc': 212, 'wxy': 206, 'efg': 256, 'opq': 255, 'tuv': 128, 'jkl': 158, 'pqr': 291, 'fgh': 191, 'lmn': 259, 'rst': 140, 'hij': 192, 'nop': 202, 'bcd': 258, 'klm': 145, 'stu': 293, 'ghi': 264, 'def': 260}
>>> sorted(some_dict.items(), key = operator.itemgetter(1), reverse = True)[:10]
[('stu', 293), ('pqr', 291), ('uvw', 288), ('ghi', 264), ('cde', 262), ('def', 260), ('lmn', 259), ('bcd', 258), ('efg', 256), ('opq', 255)]

If you are using heapq, to create the heap, you need nlogn operations, if you are building a heap by inserting the elements or logn if you heapify a list, followed by mlogn operations to fetch the top m elements

If you are sorting the items, Python Sorting algorithm is guaranteed to be O(nlogn) in worst case (refer TIM Sort) and fetching the first 10 elements would be a constant operations

Share:
23,209

Related videos on Youtube

gizgok
Author by

gizgok

Updated on July 09, 2022

Comments

  • gizgok
    gizgok almost 2 years

    I have a dictionary as follows:

    {'abc':100,'xyz':200,'def':250 .............}
    

    It is a dictionary with keys as a name of a entity and the value is count of that entity. I need to return the top 10 elements of from the dictionary.

    I can write a heap to do it, but I'm not sure how to do value to key mapping as certain values will be equal.

    Is there any other data structure to do this?

    • dawg
      dawg about 11 years
      Does 'top 10 elements' mean the greatest 10 values?
    • gizgok
      gizgok about 11 years
      Yes it means greatest values
    • Bakuriu
      Bakuriu about 11 years
      How should the duplicates be handled? I mean, should we simply ignore them as if they were different values, or should we keep only one of the duplicated values?
  • Bakuriu
    Bakuriu about 11 years
    Actually building the heap is O(n) and not O(nlogn). From the documentation: "heapify(x) # transforms list into a heap, in-place, in linear time"
  • Paul Hankin
    Paul Hankin about 11 years
    +1, but the call to heapify is redundant and in fact, heapq.nlargest(the_dict.items()) is all you need.
  • gizgok
    gizgok about 11 years
    In worst case quick select can go till O(n^2), though that does not happen far too often unless the pivot selected is the max element. I had median of medians and quick select implemented, I also wanted to get heap to do a comparison. To overcome quick select u can take about 3 random numbers from the list and take their median as the pivot.