SortedSet<T> vs HashSet<T>

c# .net collections generics

54,942

Solution 1

If you don't need sorting, you shouldn't use a class that does sorting because that means your application will be doing more work than it needs to. (It will make your app faster, in other words).

Solution 2

This is about choosing the right tool for the job. Depends on the way you are going to use your collection.

This page has a nice table detailing the differences between various collection classes.

Below is an extract from that table concerning the collections you are asking about:

Collection  Ordering    Contiguous Storage? Direct Access?  Lookup Efficiency   Manipulate Efficiency
SortedSet   Sorted          No              Via Key             Key:O(log n)            O(log n)            
HashSet     Unordered       Yes             Via Key             Key:O(1)                O(1)

Solution 3

Both HashSet<T> and SortedSet<T> are implementing interface ISet<T> which is a data structure holding unique elements.

The main difference between them is the underlying data structure they use to store data. HashSet<T> uses a hash-table while SortedSet<T> uses a red-black tree which is a balanced binary tree.

The HashSet<T> which uses a hash table does the basic operations (i.e. Add, Remove, Search) faster than SortedSet<T> as the complexity of HashSet<T> is O(1) meaning it will do basic operations independent of the size of input data in a constant period of time while the complexity of SortedSet<T> is log(N) meaning depend on the size of input it will do the basic operations logarithmic. for example if the size of your input data is 1,000 then then the program does the basic operations in 10 steps and if it is 1,000,000 the program does the basic operations in 20 steps.

Conclusion: Use HashSet<T> if you don't need the elements to be sorted otherwise use SortedSet<T>. It means Using HashSet<T> is preferable unless you need sorting.

54,942

Author by

Batrickparry

Updated on July 09, 2022

Comments

Batrickparry almost 2 years

My question is that what is the need of HashSet<T> when we have SortedSet<T>! All HashSet's methods are available in SortedSet too, moreover SortedSet is advantageous as it provides collection already in sorted manner! Even then HashSet is present. For what is it useful then?
Christian Mann over 13 years

More importantly, the algorithm will run faster. Hashing is O(1), whereas the sorted set is likely using a binary search tree, which is O(log n) in the average case -- far worse performance.
Batrickparry over 13 years

In that case we can use List<T>, isn't it! Why need HashSet<T>??
OnesimusUnbound over 13 years

Set are for unique items, List may contain duplicate entries. msdn.microsoft.com/en-us/library/bb359438.aspx for the HashSet<T> documentation. It says: A set is a collection that contains no duplicate elements, and whose elements are in no particular order.
Batrickparry over 13 years

Please tell me one more thing... what does O(1) mean? I know it's a silly question, but seriously I have no idea what it is!
Jacob over 13 years

It's a rough indicator of the computational intensity of an algorithm. See en.wikipedia.org/wiki/Big_O_notation
Jeff Mercado over 13 years

@Novice: In layman's terms, an algorithm running in O(1) means that it runs in the same amount of time regardless the size of the input. Otherwise, the time is dependent on the size of the input n and is expressed as a function of n. e.g., linear: O(n), quadratic: O(n^2), etc. The big-O wiki page can be a tough read, this one summarizes it well I think.
Bobby Adams about 12 years

FYI, the documentation for SortedSet appears to be inconsistent. My local help says that SortedSet(Of T).Contains runs in O(1) time, while MSDN online says it runs in O(ln n) time. HashSet, however, runs in O(1) time in both versions of the documentation.
Lucero almost 12 years

@BlueMonkMN, the online version (MSDN) was obviously fixed compared to your older, wrong version. A SortedSet<> performs lookups in O(log n) time, a HashSet<> in O(1) time, and a List<> in O(n) time.
jwezorek over 6 years

This is correct if your primary concern is speed but if your primary concern is memory, like say you are going to have a collection of many sets, a sortedset will all things being equal tend to use less space than a hash set because a sorted set is (presumably) a binary search tree while a hash set is a hash table that needs to be initiazed with some capacity with probably a heuristically determined amount of room to grow.
Dai about 5 years

I saw the article you linked to lists the Hashtable-based types (Dictionary<TKey,TValue> and HashSet<T>) as "contiguous" - but they don't store their keys contiguously in their internal (i.e. often sparse) key arrays, whereas SortedDictionary and SortedSet use a dense tree, so I'm surprised the article doesn't touch on the memory usage of SortedSet vs HashSet for differing numbers of keys.
Servy over 3 years

@Svisstack It is not appropriate to edit answers on this question as you have been. If you feel the answer is incorrect, feel free to post you own answer that provides the facts you feel are accurate (or a comment explaining why you feel the author ought to change it). Additionally, this answer is quoting another source, changing the content is now claiming that another source is saying something that it is not, which is super bad. Your edit is also just factually wrong, but that's simply moot; it's not an appropriate edit anyway.
Servy over 3 years

@jwezorek A tree based implementation requires an additional object and reference for every single item in the collection. If the collection is of reference types (a common situation) then the collection uses more than twice as much memory as an array would. Hashsets are going to be about that bad in the worse case, and won't be nearly as bad in the average case.
Svisstack over 3 years

@Servy: it's not factually wrong, if you think that you can lookup on the HashSet in the O(1) then you are crazy, please read more about the en.wikipedia.org/wiki/Big_O_notation
Servy over 3 years

@Svisstack Technically lookup in a hashset is O(m) where m is the average hash collision rate of your hash function. For a perfectly evenly distributed hash function, that results in the lookup being O(1), for a perfectly bad hash function that always collides, that'd make the lookup O(n) where n is the size of the set. You generally only use a hash set with types that can have a good hash function, making it O(1) in most practical situations. What makes you think it's O(log(n))?
Zar Shardan over 3 years

@Svisstack en.wikipedia.org/wiki/Hash_table
Svisstack over 3 years

@Servy: You can't assume that your hash function is good and you have a limit set of data and preventing collisions in the big-o calculation. If you have a collision then you basically have Set and this is why you have log(n). If you are claiming constant on the lookup then this is the BEST CASE not the Big-O as in the Big-O you are assuming that the data approching infinity (it's the mathematical limit). Hashing completely doesn't matter in limit as the hash function and table size have constant size en.wikipedia.org/wiki/Big_O_notation
Servy over 3 years

@Svisstack "You can't assume that your hash function is good" Well, you can. And most everyone does. If you can't properly hash the object you shouldn't be using it in a hash based collection. Some people will put an asterisk on that notation to indicate it assumes a good hash, because you're right it is an assumption made when stating O(1), even when it's a valid assumption. "If you have a collision then you basically have Set " No, then you have a list. Searching through it requires a linear search, which is O(m) where m is the number of items in that hash bucket.
Servy over 3 years

@Svisstack Very few data structures are designed in such a way that they will work properly, or even at all, when the number of items exceeds int.MaxValue. Saying that if you have much, much more than int.MaxValue items operations are no longer O(1) is pretty meaningless as it's just not a supported use case from the start, so it's not a useful point to consider. Things like HashSet, List, and Dictionary and so on will literally just break because arrays aren't allowed to be that big and they're all backed by a large array.
Servy over 3 years

@Svisstack "If we would like in the world like that, put into hashset one item and you will have the same the contains time as if you will put max random items into the same hashset." Yes. That is absolutely true assuming the item has a well defined hashing function. That's why hash sets are so useful. Checking if an item is in the set requires about the same amount of time regardless of the number of items in the set. It has restrictions on when you can do that (the item needs to be efficiently hashable), but that's a condition that's often achievable.
RainCast almost 3 years

This is the best answer under this question. Very informational, thanks for sharing.