Why is .Contains slow? Most efficient way to get multiple entities by primary key?

45,222

Solution 1

UPDATE: With the addition of InExpression in EF6, the performance of processing Enumerable.Contains improved dramatically. The analysis in this answer is great but largely obsolete since 2013.

Using Contains in Entity Framework is actually very slow. It's true that it translates into an IN clause in SQL and that the SQL query itself is executed fast. But the problem and the performance bottleneck is in the translation from your LINQ query into SQL. The expression tree which will be created is expanded into a long chain of OR concatenations because there is no native expression which represents an IN. When the SQL is created this expression of many ORs is recognized and collapsed back into the SQL IN clause.

This does not mean that using Contains is worse than issuing one query per element in your ids collection (your first option). It's probably still better - at least for not too large collections. But for large collections it is really bad. I remember that I had tested some time ago a Contains query with about 12.000 elements which worked but took around a minute even though the query in SQL executed in less than a second.

It might be worth to test the performance of a combination of multiple roundtrips to the database with a smaller number of elements in a Contains expression for each roundtrip.

This approach and also the limitations of using Contains with Entity Framework is shown and explained here:

Why does the Contains() operator degrade Entity Framework's performance so dramatically?

It's possible that a raw SQL command will perform best in this situation which would mean that you call dbContext.Database.SqlQuery<Image>(sqlString) or dbContext.Images.SqlQuery(sqlString) where sqlString is the SQL shown in @Rune's answer.

Edit

Here are some measurements:

I have done this on a table with 550000 records and 11 columns (IDs start from 1 without gaps) and picked randomly 20000 ids:

using (var context = new MyDbContext())
{
    Random rand = new Random();
    var ids = new List<int>();
    for (int i = 0; i < 20000; i++)
        ids.Add(rand.Next(550000));

    Stopwatch watch = new Stopwatch();
    watch.Start();

    // here are the code snippets from below

    watch.Stop();
    var msec = watch.ElapsedMilliseconds;
}

Test 1

var result = context.Set<MyEntity>()
    .Where(e => ids.Contains(e.ID))
    .ToList();

Result -> msec = 85.5 sec

Test 2

var result = context.Set<MyEntity>().AsNoTracking()
    .Where(e => ids.Contains(e.ID))
    .ToList();

Result -> msec = 84.5 sec

This tiny effect of AsNoTracking is very unusual. It indicates that the bottleneck is not object materialization (and not SQL as shown below).

For both tests it can be seen in SQL Profiler that the SQL query arrives at the database very late. (I didn't measure exactly but it was later than 70 seconds.) Obviously the translation of this LINQ query into SQL is very expensive.

Test 3

var values = new StringBuilder();
values.AppendFormat("{0}", ids[0]);
for (int i = 1; i < ids.Count; i++)
    values.AppendFormat(", {0}", ids[i]);

var sql = string.Format(
    "SELECT * FROM [MyDb].[dbo].[MyEntities] WHERE [ID] IN ({0})",
    values);

var result = context.Set<MyEntity>().SqlQuery(sql).ToList();

Result -> msec = 5.1 sec

Test 4

// same as Test 3 but this time including AsNoTracking
var result = context.Set<MyEntity>().SqlQuery(sql).AsNoTracking().ToList();

Result -> msec = 3.8 sec

This time the effect of disabling tracking is more noticable.

Test 5

// same as Test 3 but this time using Database.SqlQuery
var result = context.Database.SqlQuery<MyEntity>(sql).ToList();

Result -> msec = 3.7 sec

My understanding is that context.Database.SqlQuery<MyEntity>(sql) is the same as context.Set<MyEntity>().SqlQuery(sql).AsNoTracking(), so there is no difference expected between Test 4 and Test 5.

(The length of the result sets was not always the same due to possible duplicates after the random id selection but it was always between 19600 and 19640 elements.)

Edit 2

Test 6

Even 20000 roundtrips to the database are faster than using Contains:

var result = new List<MyEntity>();
foreach (var id in ids)
    result.Add(context.Set<MyEntity>().SingleOrDefault(e => e.ID == id));

Result -> msec = 73.6 sec

Note that I have used SingleOrDefault instead of Find. Using the same code with Find is very slow (I cancelled the test after several minutes) because Find calls DetectChanges internally. Disabling auto change detection (context.Configuration.AutoDetectChangesEnabled = false) leads to roughly the same performance as SingleOrDefault. Using AsNoTracking reduces the time by one or two seconds.

Tests were done with database client (console app) and database server on the same machine. The last result might get significantly worse with a "remote" database due to the many roundtrips.

Solution 2

The second option is definitely better than the first. The first option will result in ids.Length queries to the database, while the second option can use an 'IN' operator in the SQL query. It will basically turn your LINQ query into something like the following SQL:

SELECT *
FROM ImagesTable
WHERE id IN (value1,value2,...)

where value1, value2 etc. are the values of your ids variable. Be aware, however, that I think there may be an upper limit on the number of values that can be serialized into a query in this way. I'll see if I can find some documentation...

Solution 3

I am using Entity Framework 6.1 and found out using your code that, is better to use:

return db.PERSON.Find(id);

rather than:

return db.PERSONA.FirstOrDefault(x => x.ID == id);

Performance of Find() vs. FirstOrDefault are some thoughts on this.

Solution 4

Weel, recently I have a similar problem and the best way I found was insert the list in a temp Table and then make a join.

private List<Foo> GetFoos(IEnumerable<long> ids)
{
    var sb = new StringBuilder();
    sb.Append("DECLARE @Temp TABLE (Id bitint PRIMARY KEY)\n");

    foreach (var id in ids)
    {
        sb.Append("INSERT INTO @Temp VALUES ('");
        sb.Append(id);
        sb.Append("')\n");
    }

    sb.Append("SELECT f.* FROM [dbo].[Foo] f inner join @Temp t on f.Id = t.Id");

    return this.context.Database.SqlQuery<Foo>(sb.ToString()).ToList();
}

It's not a pretty way, but for large lists it is very performatic.

Share:
45,222
Tom
Author by

Tom

Updated on January 19, 2022

Comments

  • Tom
    Tom over 2 years

    What's the most efficient way to select multiple entities by primary key?

    public IEnumerable<Models.Image> GetImagesById(IEnumerable<int> ids)
    {
    
        //return ids.Select(id => Images.Find(id));       //is this cool?
        return Images.Where( im => ids.Contains(im.Id));  //is this better, worse or the same?
        //is there a (better) third way?
    
    }
    

    I realise that I could do some performance tests to compare, but I am wondering if there is in fact a better way than both, and am looking for some enlightenment on what the difference between these two queries is, if any, once they have been 'translated'.

  • Tom
    Tom over 12 years
    Thanks, is this the way to go then do you think? Or is there an alternative approach?
  • Reed Copsey
    Reed Copsey over 12 years
    @Tim: This is definitely the way to go. As long as you're on EF 4+, you can use this, and get a single fetch to the db...
  • Rune
    Rune over 12 years
    I think this is the way to go - I am not aware of a better approach. I'll see if I can find that size limit documented somewhere. If it exists, you can just divide ids into chunks of an appropriate size. If your collection of ids is known to be small, you may even be able to just ignore the issue. I believe the limit is something like 512 or 1024...
  • Rune
    Rune over 12 years
    Hmm, can't find that upper limit documented anywhere. Maybe I am mistaken and it only applied to Linq2SQL. I'll upvote anybody who documents (possibly, the lack of) that upper limit :-)
  • Rune
    Rune over 12 years
    I have tested this with > 4096 entries in the collection. It works, though serializing the entries takes a non-trivial amount of time.
  • Tom
    Tom over 12 years
    Like how much time and what exactly was > 4096? Is there a better general approach to this class of problem I wonder?
  • Allon Guralnek
    Allon Guralnek over 12 years
    There's no limit in EF since it converts it to a non-parameterized IN expression (column IN (val1, val2, val3, ...)) whereas LINQ-to-SQL parameterizes all the IN values (column IN (@p1, @p2, @p3, ...)) and so you hit the 2100 parameter limit pretty quickly.
  • Allon Guralnek
    Allon Guralnek over 12 years
    @Tom: Yes, there's a better general approach - table valued parameters. But that requires writing stored procedures and is not supported in LINQ-to-SQL or EF, plus it has its own maladies (e.g. caching an unsuitable query plan).
  • Tom
    Tom over 12 years
    Would I be correct in concluding that this kind of query has a (little) code smell about it?
  • Tom
    Tom over 12 years
    I am starting to think that ideally, from a performance point of view, this type of query is something that should simply be avoided.
  • Slauma
    Slauma over 12 years
    @Tom: I did some tests, see my Edit.
  • Tom
    Tom over 12 years
    Great answer. Probably the only thing I would add to that is.. If this sort of thing can be avoided by designing around it, then it probably should be. Would you agree?
  • Slauma
    Slauma over 12 years
    @Tom: I've done a test number 6, see my Edit 2. Yes, avoiding this situation in the first place seems a good strategy. But sometimes you might just need such a query and can't circumvent it. My conclusion is more: Sometimes using raw SQL makes sense or is even a MUST for performance reasons with an ORM like Entity Framework.
  • Slauma
    Slauma over 12 years
    @Tom: +1 for your question now, I forgot it. Because I learned a lot and it helped me to detect a serious performance bottleneck in an own application. Thanks for the question :)
  • ThisGuy
    ThisGuy over 10 years
    Great answer and research. A super minor note, but you can build the list of IDs in Test 3 in one line like this: string values = string.Join(", ", ids);
  • Tom
    Tom about 9 years
    also see stackoverflow.com/questions/11686225/… especially the comments at the bottom
  • Juanito
    Juanito about 9 years
    excelent information Tom, it was exactly what I was looking for
  • spender
    spender over 8 years
    This answer is superb. Great effort.
  • stack
    stack almost 7 years
    Running a loop was faster for me
  • Gert Arnold
    Gert Arnold almost 6 years
    Sorry, but this doesn't make sense. Images.toArray() pulls the whole Images table into memory. You can't seriously mean that. Also, it's not clear what the first line is doing there.
  • Christopher Townsend
    Christopher Townsend about 5 years
    This solution is very well described and helped a lot. The only thing to be aware of is that using SqlQuery in this way wont work for entities that use inheritance.
  • Sebastian
    Sebastian over 2 years
    How this will affect if multiple users access same code block with different selection sets [Potentially from a web browser ]
  • nelson eldoro
    nelson eldoro over 2 years
    @Sebastian Actually it can be a mess but in your case may you can concatenate the session id in the table name or another identifier like string tempTableName = "@Temp_" + HttpContext.Session.SessionID;