how does except method work in linq

12,560

Solution 1

Your guess was close - the Linq to Objects Except extension method uses a HashSet<T> internally for the second sequence passed in - that allows it to look up elements in O(1) while iterating over the first sequence to filter out elements that are contained in the second sequence, hence the overall effort is O(n+m) where n and m are the length of the input sequences - this is the best you can hope to do since you have to look at each element at least once.

For a review of how this might be implemented I recommend Jon Skeet's EduLinq series, here part of it's implementation of Except and the link to the full chapter:

private static IEnumerable<TSource> ExceptImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    HashSet<TSource> bannedElements = new HashSet<TSource>(second, comparer);
    foreach (TSource item in first)
    {
        if (bannedElements.Add(item))
        {
            yield return item;
        }
    }
}

Your first implementation on the other hand will compare each element in the first list to each element in the second list - it is performing a cross product. This will require nm operations so it will run in O(nm) - when n and m become large this becomes prohibitively slow very fast. (Also this solution is wrong as is since it will create duplicate elements).

Solution 2

The two code examples do not produce the same results.

Your old code creates the Cartesian Product of the two lists.

That means it will return each element a in list1 multiple times - once for each element b in list2 that is not equal to a.

With "large" lists, this will take a long time.

Solution 3

from a in list1 from b in list2 creates list1.Count * list2.Count elements and is not the same as list1.Except(list2)!

If list1 has the elements { a, b, c, d } and list2 the elements { a, b, c }, then your first query will yield the following pairs:

(a,a), (a,b), (a,c),  
(b,a), (b,b), (b,c),  
(c,a), (c,b), (c,c),  
(d,a), (d,b), (d,c)

because you exclude equal items the result will be

(a,a), (a,b), (a,c),  
(b,a), (b,b), (b,c),  
(c,a), (c,b), (c,c),  
(d,a), (d,b), (d,c)

And because you select only the first element of the pairs, you will get

{ a, a, b, b, c, c, d, d, d }


The second query will yield { a, b, c, d } minus { a, b, c }, i.e { d }.


If no hash table is was used in Exclude then a nested loop performing with O(m*n) would result. With a hash table the query approximately performs with O(n) (neglecting the cost for filling the hash table).

Share:
12,560

Related videos on Youtube

Tono Nam
Author by

Tono Nam

Updated on June 04, 2022

Comments

  • Tono Nam
    Tono Nam almost 2 years

    I have the classes:

    class SomeClass
    {
       public string Name{get;set;}
       public int SomeInt{get;set;}
    }
    
    
    class SomeComparison: IEqualityComparer<SomeClass>
    {
         public bool Equals(SomeClass s, SomeClass d)
         {
             return s.Name == d.Name;
         }
    
         public int GetHashCode(SomeClass a)
         {
             return (a.Name.GetHashCode() * 251);
         }
    }
    

    I also have two large List<SomeClass> called list1 and list2

    before I used to have:

     var q = (from a in list1
             from b in list2
             where a.Name != b.Name
             select a).ToList();
    

    and that took about 1 minute to execute. Now I have:

    var q =  list1.Except(list2,new SomeComparison()).ToList();
    

    and that takes less than 1 second!

    I will like to understand what does the Except method do. Does the method creates a hash table of each list and then perform the same comparison? If I will be performing a lot of this comparisons should I create a Hashtable instead?


    EDIT

    Now instead of having lists I have two HashSet<SomeClass> called hashSet1 and hashSet2

    when I do:

       var q = (from a in hashSet1
               form b in hashSet2
               where a.Name != b.Name
               select a).ToList();
    

    that still takes a long time... What am I doing wrong?

    • Daniel Mann
      Daniel Mann about 12 years
      Did you look at the method with a reflector like ILSpy?
    • maria nabil
      maria nabil about 12 years
      You're still doing a cross product. do from a in hashSet1 where !hashSet2.Contains(a.Name) select a
  • svick
    svick about 12 years
    Okay, but why is that so much faster than the first option?
  • Tono Nam
    Tono Nam about 12 years
    I think cause it does not have to iterate through the whole list2 again in order to make the comparison. When adding if there is a collision it does not add the results...
  • Tono Nam
    Tono Nam about 12 years
    BrokenGlass if I will be performing a lot of this comparisons should I create a HashSet instead?
  • BrokenGlass
    BrokenGlass about 12 years
    @TonoNam: Yes - use a HashSet, also I don't think your current approach - despite being slow - will produce the desired results - it will produce duplicates. See updated answer.
  • Tono Nam
    Tono Nam about 12 years
    BrokenGlass I am working on an update... thanks a lot for the help.
  • Jeff Mercado
    Jeff Mercado about 12 years
    If you want to be nitpicky, internally they use a custom implementation of a hash set. They don't use the one included in the framework. I suppose to keep it lightweight with a less bloated version.
  • Plutoz
    Plutoz about 8 years
    Distinct at the end is confusing. The question contains Lists and this allow multiple occurrences.
  • Rezo Megrelidze
    Rezo Megrelidze about 8 years
    @Plutoz Except is a set operations and sets always have distinct elements.