Split a collection into `n` parts with LINQ?

85,808

Solution 1

A pure linq and the simplest solution is as shown below.

static class LinqExtensions
{
    public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> list, int parts)
    {
        int i = 0;
        var splits = from item in list
                     group item by i++ % parts into part
                     select part.AsEnumerable();
        return splits;
    }
}

Solution 2

EDIT: Okay, it looks like I misread the question. I read it as "pieces of length n" rather than "n pieces". Doh! Considering deleting answer...

(Original answer)

I don't believe there's a built-in way of partitioning, although I intend to write one in my set of additions to LINQ to Objects. Marc Gravell has an implementation here although I would probably modify it to return a read-only view:

public static IEnumerable<IEnumerable<T>> Partition<T>
    (this IEnumerable<T> source, int size)
{
    T[] array = null;
    int count = 0;
    foreach (T item in source)
    {
        if (array == null)
        {
            array = new T[size];
        }
        array[count] = item;
        count++;
        if (count == size)
        {
            yield return new ReadOnlyCollection<T>(array);
            array = null;
            count = 0;
        }
    }
    if (array != null)
    {             
        Array.Resize(ref array, count);
        yield return new ReadOnlyCollection<T>(array);
    }
}

Solution 3

static class LinqExtensions
{
    public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> list, int parts)
    {
            return list.Select((item, index) => new {index, item})
                       .GroupBy(x => x.index % parts)
                       .Select(x => x.Select(y => y.item));
    }
}

Solution 4

Ok, I'll throw my hat in the ring. The advantages of my algorithm:

  1. No expensive multiplication, division, or modulus operators
  2. All operations are O(1) (see note below)
  3. Works for IEnumerable<> source (no Count property needed)
  4. Simple

The code:

public static IEnumerable<IEnumerable<T>>
  Section<T>(this IEnumerable<T> source, int length)
{
  if (length <= 0)
    throw new ArgumentOutOfRangeException("length");

  var section = new List<T>(length);

  foreach (var item in source)
  {
    section.Add(item);

    if (section.Count == length)
    {
      yield return section.AsReadOnly();
      section = new List<T>(length);
    }
  }

  if (section.Count > 0)
    yield return section.AsReadOnly();
}

As pointed out in the comments below, this approach doesn't actually address the original question which asked for a fixed number of sections of approximately equal length. That said, you can still use my approach to solve the original question by calling it this way:

myEnum.Section(myEnum.Count() / number_of_sections + 1)

When used in this manner, the approach is no longer O(1) as the Count() operation is O(N).

Solution 5

This is same as the accepted answer, but a much simpler representation:

public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> items, 
                                                   int numOfParts)
{
    int i = 0;
    return items.GroupBy(x => i++ % numOfParts);
}

The above method splits an IEnumerable<T> into N number of chunks of equal sizes or close to equal sizes.

public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> items, 
                                                       int partitionSize)
{
    int i = 0;
    return items.GroupBy(x => i++ / partitionSize).ToArray();
}

The above method splits an IEnumerable<T> into chunks of desired fixed size with total number of chunks being unimportant - which is not what the question is about.

The problem with the Split method, besides being slower, is that it scrambles the output in the sense that the grouping will be done on the basis of i'th multiple of N for each position, or in other words you don't get the chunks in the original order.

Almost every answer here either doesn't preserve order, or is about partitioning and not splitting, or is plainly wrong. Try this which is faster, preserves order but a lil' more verbose:

public static IEnumerable<IEnumerable<T>> Split<T>(this ICollection<T> items, 
                                                   int numberOfChunks)
{
    if (numberOfChunks <= 0 || numberOfChunks > items.Count)
        throw new ArgumentOutOfRangeException("numberOfChunks");

    int sizePerPacket = items.Count / numberOfChunks;
    int extra = items.Count % numberOfChunks;

    for (int i = 0; i < numberOfChunks - extra; i++)
        yield return items.Skip(i * sizePerPacket).Take(sizePerPacket);

    int alreadyReturnedCount = (numberOfChunks - extra) * sizePerPacket;
    int toReturnCount = extra == 0 ? 0 : (items.Count - numberOfChunks) / extra + 1;
    for (int i = 0; i < extra; i++)
        yield return items.Skip(alreadyReturnedCount + i * toReturnCount).Take(toReturnCount);
}

The equivalent method for a Partition operation here

Share:
85,808
aruno
Author by

aruno

Updated on September 01, 2021

Comments

  • aruno
    aruno over 2 years

    Is there a nice way to split a collection into n parts with LINQ? Not necessarily evenly of course.

    That is, I want to divide the collection into sub-collections, which each contains a subset of the elements, where the last collection can be ragged.

  • Marc Gravell
    Marc Gravell over 15 years
    You really don't like those "array[count++]", eh ;-p
  • Jon Skeet
    Jon Skeet over 15 years
    Ironically I might well have used it before yesterday. But having written that answer it would have been hypocritical - and looking at both versions I think this is slightly easier to read at a glance. At first I used a list instead - Add is even more readable :)
  • Marc Gravell
    Marc Gravell over 15 years
    Well, I just added a slightly different answer (to address "n pieces" rather than "pieces of length n"), and mixed in parts of your version ;-p
  • Jonathan Allen
    Jonathan Allen about 15 years
    Doing all those modulus operations can get a bit expensive on long lists.
  • Marc Gravell
    Marc Gravell over 14 years
    It would be better to use the Select overload that includes the index.
  • reustmd
    reustmd about 13 years
    I've added a response that uses the select overload and method chaining syntax
  • ShadowChaser
    ShadowChaser about 13 years
    This version breaks the contract of IEnumerator. It's not valid to throw InvalidOperationException when Reset is called - I believe many of the LINQ extension methods rely on this behavior.
  • ShadowChaser
    ShadowChaser about 13 years
    I initially programmed the same thing, but the pattern breaks when Reset is called on one of the nested IEnumerable<T> instances.
  • ShadowChaser
    ShadowChaser about 13 years
    Brilliant - best solution here! A few optimizations: * Clear the linked list instead of creating a new one for each section. A reference to the linked list is never returned to the caller, so it's completely safe. * Don't create the linked list until you reach the first item - that way there's no allocation if the source is empty
  • Mike
    Mike about 13 years
    @ShadowChaser According to MSDN clearing the LinkedList is O(N) complexity so it would ruin my goal of O(1). Of course, you could argue that the foreach is O(N) to begin with... :)
  • aruno
    aruno over 12 years
    of course this time around I needed pieces of 8 rather than 8 pieces :-) and this worked great
  • Brad
    Brad over 12 years
    Does this still work if you only enumerate the partition and not the inner enumerable? since the inner enumerator is deferred then none of the code for the inner (take from current) will execute until it is enumerated therefore movenext() will only be called by the outer partition function, right? If my assumptions are true then this can potentially yield n partitions with n elements in the original enumerable and the inner enumerables will yield unexpected results
  • Robin Maben
    Robin Maben over 12 years
    @HasanKhan: What if I dont know how many parts I'm splitting it into. I mean, split into groups of 3, lets say.
  • Elmer
    Elmer over 12 years
    Please explain why. I have been using this function without any trouble!
  • Muhammad Hasan Khan
    Muhammad Hasan Khan over 12 years
    read the question again and see if you get n (almost) equal length parts with your function
  • Gishu
    Gishu over 11 years
    Thank you for not deleting even though it is not an answer for the OP, I wanted the exact same thing - pieces of length n :).
  • nawfal
    nawfal over 11 years
    your answer is right, but the question is wrong for it. Your answer gives unknown number of chunks with fixed size for each chunk. But OP wants a Split functionality where it gives fixed number of chunks with any size per chunk(hopefully of equal or close to equal sizes). Perhaps more suited here stackoverflow.com/questions/3773403/…
  • nawfal
    nawfal over 11 years
    @Elmer your answer is right, but the question is wrong for it. Your answer gives unknown number of chunks with fixed size for each chunk (exactly as Partition, the name you have given for it). But OP wants a Split functionality where it gives fixed number of chunks with any size per chunk(hopefully of equal or close to equal sizes). Perhaps more suited here stackoverflow.com/questions/3773403/…
  • Mike
    Mike over 11 years
    @nawfal - I agree. I approached it this way because it doesn't require knowing the enumerable's length ahead of time, which is more efficient for some applications. If you want a fixed number of chunks then you can call myEnum.Section(myEnum.Count() / number_of_chunks + 1). I will update my answer to reflect this.
  • nawfal
    nawfal over 11 years
    It can be done. Instead of direct casting just make the function generic and then call it for your int array
  • nawfal
    nawfal over 11 years
    @Mike, why all the hassle of linkedlist, separate static function etc? Why not a straightforward approach which I believe makes the function even faster? In the previous thread I posted, here's my answer which is a similar implementation which is faster with no hassles of collection structures.
  • Mike
    Mike over 11 years
    @nawfal - I wanted an approach that was O(1) or "constant time." The Skip() inside your loop makes your approach O(N) because the time required to process the skip increases linearly with the number of items. Consider an enumerable with 1e12 items being separated into groups of two--the further you get into the enumerable, the slower it will go. I was also trying to avoid multiplication, which is expensive to compute relative to the other operations.
  • nawfal
    nawfal over 11 years
    @Mike have you benchmarked it? I hope you know O(1) doesn't mean faster, it only means that the time required for partitioning doesn't scale. I'm just wondering what's your rationale to blindly stick to O(1) when it can be slower than other O(n)'s for all the real life scenarios. I even tested it for a crazy 10^8 strength list and mine appeared to be faster still. I hope you know there are not even standard collection types that can hold 10^12 items..
  • nawfal
    nawfal over 11 years
    Few points: 1) why do you keep insisting that multiplication/modulo operations etc are expensive? 2) myEnum.Section(myEnum.Count() / number_of_sections + 1) enumerates the first collection (and for the given question can be inefficient), so pls edit the advantages you have mentioned for your approach.
  • nawfal
    nawfal over 11 years
    3) You may use a lightweight collection structure instead of LinkedList<T> as in Jon Skeet's or Brad's answers, they tend to be faster. As I said O(1) doesn't guarantee speed. The possible overhead with LinkedList outweighs its gains here (they are meant for random removal and insertion). See this link here
  • Mike
    Mike over 11 years
    @nawfal - Thank you for your detailed analysis, it helps keep me on my toes. Linked lists in general are known for efficient end-inserts, which is why I selected it here. However I just benchmarked it and indeed the List<> is much faster. I suspect this is some sort of .NET implementation detail, perhaps deserving of a separate StackOverflow question. I have modified my answer to use List<> as per your suggestion. Preallocating the list capacity guarantees that end-insertion is still O(1) and meets my original design goal. I also switched to the built-in .AsReadOnly() in .NET 4.5.
  • Mike
    Mike over 11 years
    @nawfal - At the machine code level, mult/div/mod are generally much more expensive (by an order of magnitude) than simple operations such as set/get/compare. Of course it varies by processor, compiler, and implementation, so you are indeed wise to benchmark your specific case if this sectioning operation is creating a bottleneck.
  • toong
    toong almost 11 years
    @ShadowChaser I think Reset() should throw a NotSupportedException and everything would be fine. From the MSDN documentation: "The Reset method is provided for COM interoperability. It does not necessarily need to be implemented; instead, the implementer can simply throw a NotSupportedException."
  • ShadowChaser
    ShadowChaser almost 11 years
    @toong Wow, you're right. Not sure how I missed that after all this time.
  • ironic
    ironic over 10 years
    Sorry, this may be a dumb question, but why do we need .AsReadOnly() call here?
  • Mike
    Mike over 10 years
    Technically you don't need it but it's kind of a "best practice" since we're returning an IEnumerable<> which implies that it is immutable.
  • piedar
    piedar over 10 years
    I have an irrational dislike of SQL-style Linq, so this is my favorite answer.
  • CodeFox
    CodeFox over 10 years
    @Jon, I just wonder why you prefer the ReadOnlyCollection to a read-only array in this case?
  • CodeFox
    CodeFox about 10 years
    @Jon: Ah - good point! I have had the fixed length of arrays in mind, but of course - in contrast to the ReadOnlyCollection - the elements could be replaced. Thanks for your reply.
  • Frank Tzanabetis
    Frank Tzanabetis about 10 years
    @Gishu - Indeed - it's exactly what i needed too!
  • Dejan
    Dejan about 10 years
    The problem with this approach is that it traverses the complete source list (requiring to have the whole list in-memory) before moving on. Does anyone know a better solution for large enumerables?
  • Jon Skeet
    Jon Skeet about 10 years
    @Dejan: No it doesn't. Note the use of yield return. It requires one batch to be in memory at a time, but that's all.
  • Dejan
    Dejan about 10 years
    Yes, I've seen the yield return and of course my comment "needs to traverse the complete list" is plainly wrong. Sorry. What happened was this: I was trying to use the method in a Parallel.ForEach in order to partition my data, which exploded in memory usage. I guess that the TPL spawns too many Tasks because the yield return only comes after filling a partition and so the first phase will be just preparing partitions in parallel. I guess what I need is a proper implementation of msdn.microsoft.com/en-us/library/dd381768(VS.100).aspx.
  • Jon Skeet
    Jon Skeet about 10 years
    @Dejan: Right - I wouldn't like to guess about how it interacts with Parallel LINQ partitioning, to be honest :)
  • SalientBrain
    SalientBrain over 9 years
    This is not working at all! The best possible is here stackoverflow.com/questions/13709626/…! See comments.
  • SalientBrain
    SalientBrain over 9 years
    It is buggy! I don't remember exactly, but (as far as I remember) it performs unwanted step and it can lead to ugly side effects (with datareader for ex.). The best solution is here (Jeppe Stig Nielsen): stackoverflow.com/questions/13709626/…
  • SalientBrain
    SalientBrain over 9 years
    Bufferization is a disadvantage if you have big chunks. I think the best solution is here (Jeppe Stig Nielsen): stackoverflow.com/questions/13709626/…
  • Jon Skeet
    Jon Skeet over 9 years
    @SalientBrain: Um, Jeppe Stig Nielsen's solution still uses batching - note the call to ToList in yield return GetNextBatch(enumerator).ToList();. Batching is necessary as otherwise you end up with very odd results if you skip one partition.
  • SalientBrain
    SalientBrain over 9 years
    @Jon Skeet: Yep, I missed that)) So no magic(. Same implementation in other words. Thanks.
  • Lucas Rodrigues Sena
    Lucas Rodrigues Sena over 9 years
    My not work, I just have one Splits whit count 1 and all my items inside this One position, any idea?
  • Muhammad Hasan Khan
    Muhammad Hasan Khan over 9 years
    @LucasRodriguesSena This method asks you for no. of splits, not how much you want in each split. You can get the no. of splits by dividing your desired size by total no. of items.
  • drzaus
    drzaus almost 9 years
    @Brad it will "fail" as you expect, similar to some of the issues in this thread stackoverflow.com/questions/419019/… (specifically stackoverflow.com/a/20953521/1037948)
  • Alex
    Alex almost 9 years
    .AsEnumerable() is not necessary, IGrouping<T> is already an IEnumerable<T>.
  • Muhammad Hasan Khan
    Muhammad Hasan Khan almost 9 years
    @Alex If you return part directly return type would be IEnumerable<IGrouping<T>> instead of IEnumerable<IEnumerable<T>>
  • Alex
    Alex almost 9 years
    @HasanKhan I didn't run the code but I tried and it compiles just fine. I think since the return type is explicitly IEnumerable<IEnumerable<T>> an implicit cast is taking place when you return part. No?
  • Muhammad Hasan Khan
    Muhammad Hasan Khan almost 9 years
    @Alex that's right. When I wrote it covariance wasn't introduce in C#.
  • Alex
    Alex almost 9 years
    @HasanKhan no harms done. Being explicit in code usually is good thing.
  • Jake Drew
    Jake Drew almost 9 years
    I think you can just change i.Index / partitionSize to i.Index % partitionSize and get the requested result. I also prefer this over the accepted answer as it is more compact and readable.
  • Karthik Arthik
    Karthik Arthik over 8 years
    @manu08, I have tried ur code, i have a list var dept = {1,2,3,4,5} . After splitting the result is like dept1 = {1,3,5} and dept2 = { 2,4 } where parts = 2. But the result i need is dept1 = {1,2,3} and dept2 = {4,5}
  • goodeye
    goodeye over 7 years
    I had the same issue with modulo, so I calculated the column length with int columnLength = (int)Math.Ceiling((decimal)(list.Count()) / parts); then did division with .GroupBy(x => x.index / columnLength). One downside is Count() enumerates the list.
  • MBoros
    MBoros about 7 years
    Next time try: var nrs = Enumerable.Range(1,2000).ToList();
  • chinookf
    chinookf almost 7 years
    For VB.NET, I had difficulty with query syntax due to lack of i++ in VB (and struggled with the Into keyword), so I used method syntax instead: Dim i As Integer = 0 Dim splits As IEnumerable(Of IEnumerable(Of T)) = list.GroupBy(Function(item) Dim groupRemainder As Integer = i Mod parts i += 1 Return groupRemainder End Function).Select(Function(group) group.AsEnumerable())
  • Edward Brey
    Edward Brey about 3 years
    This method (and others like it) relies on the list size being larger than the number of parts it's being split into.
  • claudekennilol
    claudekennilol over 2 years
    I see that this was in an RC for .net 6, but it doesn't look like this actually made it.
  • Paul Hodgson
    Paul Hodgson over 2 years
    My preferred answer as it maintains the order of the list, I had to make a modification to evenly distribute the "extra" chunks uniformly throughout the data set though.
  • Alex Dresko
    Alex Dresko over 2 years
    This code does not do the same thing as .NET 6 Chunk. If you call .NET 6 Chunk with 200 items, and a chunk value of 100, you will get 2 chunks. If you do the same thing with the code above, you'll get 100 chunks, each with 2 items.