Python "set" with duplicate/repeated elements

44,848

Solution 1

You are looking for a multiset.

Python's closest datatype is collections.Counter:

A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. The Counter class is similar to bags or multisets in other languages.

For an actual implementation of a multiset, use the bag class from the data-structures package on pypi. Note that this is for Python 3 only. If you need Python 2, here is a recipe for a bag written for Python 2.4.

Solution 2

Your approach with dict with element/count seems ok to me. You probably need some more functionality. Have a look at collections.Counter.

  • O(1) test whether an element is present and current count retrieval (faster than with element in list and list.count(element))
  • counter.elements() looks like a list with all duplicates
  • easy manipulation union/difference with other Counters

Solution 3

Python "set" with duplicate/repeated elements

This depends on how you define a set. One may assume that to the OP

  1. order does not matter (whether ordered or unordered)
  2. replicates/repeated elements (a.k.a. multiplicities) are permitted

Given these assumptions, the options reduce to two abstract types: a list or a multiset. In Python, these type usually translate to a list and Counter respectively. See the Details on some subtleties to observe.

Given

import random

import collections as ct

random.seed(123)


elems = [random.randint(1, 11) for _ in range(10)]
elems
# [1, 5, 2, 7, 5, 2, 1, 7, 9, 9]

Code

A list of replicate elements:

list(elems)
# [1, 5, 2, 7, 5, 2, 1, 7, 9, 9]

A "multiset" of replicate elements:

ct.Counter(elems)
# Counter({1: 2, 5: 2, 2: 2, 7: 2, 9: 2})

Details

On Data Structures

We have a mix of terms here that easily get confused. To clarify, here are some basic mathematical data structures compared to ones in Python.

Type        |Abbr|Order|Replicates|   Math*   |   Python    | Implementation
------------|----|-----|----------|-----------|-------------|----------------
Set         |Set |  n  |     n    | {2  3  1} |  {2, 3, 1}  | set(el)
Ordered Set |Oset|  y  |     n    | {1, 2, 3} |      -      | list(dict.fromkeys(el)
Multiset    |Mset|  n  |     y    | [2  1  2] |      -      | <see `mset` below>
List        |List|  y  |     y    | [1, 2, 2] |  [1, 2, 2]  | list(el)

From the table, one can deduce the definition of each type. Example: a set is a container that ignores order and rejects replicate elements. In contrast, a list is a container that preserves order and permits replicate elements.

Also from the table, we can see:

  • Both an ordered set and a multiset are not explicitly implemented in Python
  • "Order" is a contrary term to a random arrangement of elements, e.g. sorted or insertion order
  • Sets and multisets are not strictly ordered. They can be ordered, but order does not matter.
  • Multisets permit replicates, thus they are not strict sets (the term "set" is indeed confusing).

On Multisets

Some may argue that collections.Counter is a multiset. You are safe in many cases to treat it as such, but be aware that Counter is simply a dict (a mapping) of key-multiplicity pairs. It is a map of multiplicities. See an example of elements in a flattened multiset:

mset = [x for k, v in ct.Counter(elems).items() for x in [k]*v]
mset
# [1, 1, 5, 5, 2, 2, 7, 7, 9, 9]

Notice there is some residual ordering, which may be surprising if you expect disordered results. However, disorder does not preclude order. Thus while you can generate a multiset from a Counter, be aware of the following provisos on residual ordering in Python:

  • replicates get grouped together in the mapping, introducing some degree of order
  • in Python 3.6, dict's preserve insertion order

Summary

In Python, a multiset can be translated to a map of multiplicities, i.e. a Counter, which is not randomly unordered like a pure set. There can be some residual ordering, which in most cases is ok since order does not generally matter in multisets.

See Also

*Mathematically, (according to N. Wildberger, we express braces {} to imply a set and brackets [] to imply a list, as seen in Python. Unlike Python, commas , to imply order.

Share:
44,848

Related videos on Youtube

cammil
Author by

cammil

Updated on July 09, 2022

Comments

  • cammil
    cammil almost 2 years

    Is there a standard way to represent a "set" that can contain duplicate elements.

    As I understand it, a set has exactly one or zero of an element. I want functionality to have any number.

    I am currently using a dictionary with elements as keys, and quantity as values, but this seems wrong for many reasons.

    Motivation: I believe there are many applications for such a collection. For example, a survey of favourite colours could be represented by: survey = ['blue', 'red', 'blue', 'green']

    Here, I do not care about the order, but I do about quantities. I want to do things like:

    survey.add('blue')
    # would give survey == ['blue', 'red', 'blue', 'green', 'blue']
    

    ...and maybe even

    survey.remove('blue')
    # would give survey == ['blue', 'red', 'green']
    

    Notes: Yes, set is not the correct term for this kind of collection. Is there a more correct one?

    A list of course would work, but the collection required is unordered. Not to mention that the method naming for sets seems to me to be more appropriate.

    • jamylak
      jamylak about 12 years
      It might help by explaining why you want to do this.
    • g.d.d.c
      g.d.d.c about 12 years
      If you need duplicates it's not a set by definition. Can you demonstrate what you think you want, and maybe we can suggest an appropriate container or data type?
    • jdi
      jdi about 12 years
      This is a contradictory request unless you clarify your intent. You could technically define a custom hash method for your objects to allow duplicates in a set or dict but then it would be up to you to count them another way. I dont think you really want duplicate members if your goal is to count. Dict with count value doesnt seem wrong for many reasons.
    • Akavall
      Akavall about 12 years
      If order is not important to you, the fact that list is ordered should not be a problem for you. Is there any reason you need to randomize the order?
    • max
      max about 12 years
      If the OP rewords his question, it would be much better. It's enough to ask for a structure that "supports multiple identical elements"; asking for different method names and lack of order isn't reasonable. List isn't a good solution since it wastes memory when each element is repeated many times, and wastes time on insert/delete compared to multiset.
    • cammil
      cammil about 12 years
      The beauty of stack overflow is that sometimes you don't know what you don't know. If I knew what I "should" have been asking, I may have been able to google it. The ability for a human to understand what is needed is why I eventually ask stack overflow after fruitless time spent googling. I hope this backlash towards undereducated people seeking help doesnt prevail over common decency. JSpolsky talked of the importance of the community here, and I can't help but feel something is going awry.
    • modulitos
      modulitos almost 10 years
      Under "Notes", I believe the term that the OP is looking for is bag (a common term for multiset)
  • max
    max about 12 years
    What's the difference between collections.Counter and pypi's bag?
  • Zen
    Zen almost 10 years
    On python 2.7.6 I can run bag, why?
  • Phani
    Phani over 9 years
    One big gotcha here: len(counter_obj) gives you the number of unique elements but not the total number of elements as you expect from a multiset. But, you can do all other operations like unions and intersections just as you do with sets.
  • Steven Rumbalski
    Steven Rumbalski almost 9 years
    @Phani: Good point. For total number of elements do sum(counter_obj.values()).
  • ComputerFellow
    ComputerFellow over 7 years
    It's most likely that the OP was looking for a multiset, and transforming a list to a set loses duplicates.
  • LD Jewell
    LD Jewell over 7 years
    I posted this answer before it was edited. My approach is only use the set as a view of the original list.