When should I use the HashSet<T> type?

94,463

Solution 1

The important thing about HashSet<T> is right there in the name: it's a set. The only things you can do with a single set is to establish what its members are, and to check whether an item is a member.

Asking if you can retrieve a single element (e.g. set[45]) is misunderstanding the concept of the set. There's no such thing as the 45th element of a set. Items in a set have no ordering. The sets {1, 2, 3} and {2, 3, 1} are identical in every respect because they have the same membership, and membership is all that matters.

It's somewhat dangerous to iterate over a HashSet<T> because doing so imposes an order on the items in the set. That order is not really a property of the set. You should not rely on it. If ordering of the items in a collection is important to you, that collection isn't a set.

Sets are really limited and with unique members. On the other hand, they're really fast.

Solution 2

Here's a real example of where I use a HashSet<string>:

Part of my syntax highlighter for UnrealScript files is a new feature that highlights Doxygen-style comments. I need to be able to tell if a @ or \ command is valid to determine whether to show it in gray (valid) or red (invalid). I have a HashSet<string> of all the valid commands, so whenever I hit a @xxx token in the lexer, I use validCommands.Contains(tokenText) as my O(1) validity check. I really don't care about anything except existence of the command in the set of valid commands. Lets look at the alternatives I faced:

  • Dictionary<string, ?>: What type do I use for the value? The value is meaningless since I'm just going to use ContainsKey. Note: Before .NET 3.0 this was the only choice for O(1) lookups - HashSet<T> was added for 3.0 and extended to implement ISet<T> for 4.0.
  • List<string>: If I keep the list sorted, I can use BinarySearch, which is O(log n) (didn't see this fact mentioned above). However, since my list of valid commands is a fixed list that never changes, this will never be more appropriate than simply...
  • string[]: Again, Array.BinarySearch gives O(log n) performance. If the list is short, this could be the best performing option. It always has less space overhead than HashSet, Dictionary, or List. Even with BinarySearch, it's not faster for large sets, but for small sets it'd be worth experimenting. Mine has several hundred items though, so I passed on this.

Solution 3

A HashSet<T> implements the ICollection<T> interface:

public interface ICollection<T> : IEnumerable<T>, IEnumerable
{
    // Methods
    void Add(T item);
    void Clear();
    bool Contains(T item);
    void CopyTo(T[] array, int arrayIndex);
    bool Remove(T item);

    // Properties
   int Count { get; }
   bool IsReadOnly { get; }
}

A List<T> implements IList<T>, which extends the ICollection<T>

public interface IList<T> : ICollection<T>
{
    // Methods
    int IndexOf(T item);
    void Insert(int index, T item);
    void RemoveAt(int index);

    // Properties
    T this[int index] { get; set; }
}

A HashSet has set semantics, implemented via a hashtable internally:

A set is a collection that contains no duplicate elements, and whose elements are in no particular order.

What does the HashSet gain, if it loses index/position/list behavior?

Adding and retrieving items from the HashSet is always by the object itself, not via an indexer, and close to an O(1) operation (List is O(1) add, O(1) retrieve by index, O(n) find/remove).

A HashSet's behavior could be compared to using a Dictionary<TKey,TValue> by only adding/removing keys as values, and ignoring dictionary values themselves. You would expect keys in a dictionary not to have duplicate values, and that's the point of the "Set" part.

Solution 4

Performance would be a bad reason to choose HashSet over List. Instead, what better captures your intent? If order is important, then Set (or HashSet) is out. If duplicates are permitted, likewise. But there are plenty of circumstances when we don't care about order, and we'd rather not have duplicates - and that's when you want a Set.

Solution 5

HashSet is a set implemented by hashing. A set is a collection of values containing no duplicate elements. The values in a set are also typically unordered. So no, a set can not be used to replace a list (unless you should've use a set in the first place).

If you're wondering what a set might be good for: anywhere you want to get rid of duplicates, obviously. As a slightly contrived example, let's say you have a list of 10.000 revisions of a software projects, and you want to find out how many people contributed to that project. You could use a Set<string> and iterate over the list of revisions and add each revision's author to the set. Once you're done iterating, the size of the set is the answer you were looking for.

Share:
94,463
Joan Venge
Author by

Joan Venge

Professional hitman.

Updated on March 10, 2020

Comments

  • Joan Venge
    Joan Venge about 4 years

    I am exploring the HashSet<T> type, but I don't understand where it stands in collections.

    Can one use it to replace a List<T>? I imagine the performance of a HashSet<T> to be better, but I couldn't see individual access to its elements.

    Is it only for enumeration?

  • Joan Venge
    Joan Venge almost 15 years
    But Set doesn't allow retrieval of single elements? Like set[45]?
  • Steve Guidi
    Steve Guidi almost 15 years
    Another caveat: sets generally allow only one occurrence of an element.
  • earl
    earl almost 15 years
    For that, you'd iterate over the the members the set. Other typical operations are checking if the set contains an element or getting the size of the set.
  • sepp2k
    sepp2k almost 15 years
    Adding an existing item to a set will not throw an exception. Add will simply return false. Also: technically hash lookup is O(n), not O(1), unless you have a perfect hashing function. Of course in practice you'll get away with assuming it's O(1) unless the hashing function is really bad.
  • Noldorin
    Noldorin almost 15 years
    @sepp2k: Yeah, so it returns a boolean... The point is, it notifies you. And hash look up is worst case O(n) if you're bucketing is terrible - it's much closer to O(1) in general.
  • SamuelWarren
    SamuelWarren almost 14 years
    If you only iterate over them then the HashSet method adds quite a bit of memory usage compared to the List.
  • Hardwareguy
    Hardwareguy over 13 years
    Unless you care about the key, then you should use the dictionary.
  • Oscar Mederos
    Oscar Mederos about 13 years
    Performance would be a bad reason to choose HashSet over List: I just don't agree with you. That's kind of saying that choosing a Dictionray instead of two Lists doesn't help in performance. Take a look at the following article
  • Carl Manaster
    Carl Manaster about 13 years
    @Oscar: I didn't say that sets aren't faster - I said that would be a bad basis for choosing them. If you are trying to represent an ordered collection, a set simply won't work and it would be a mistake to try to shoehorn it in; if the collection you want has no order, a set is perfect - and fast. But what's important is the first question: what are you trying to represent?
  • Casey
    Casey almost 9 years
    But think about it. If you want to keep checking whether given strings are members of some collection of 10,000 strings, technically, string[].Contains and HashSet<string>.Contains express your intent equally well; the reason to pick the HashSet is it will run much faster.
  • Veverke
    Veverke almost 8 years
    The fact that the framework provides a SortedSet data structure either contradicts what you say about order not being a property of a set - or points out to a misunderstanding from the development team.
  • Kit
    Kit over 7 years
    I think it's more correct to say that the order of the items in the HashSet is not defined, so don't rely on the iterator's order. If you iterate the set because you are doing something against the items in the set, that is not dangerous unless you are relying on anything related to order. A SortedSet has all the properties of the HashSet plus order, however SortedSet does not derive from HashSet; rephrased, a SortedSet is an ordered collection of distinct objects.