Skip(), Take() and the poor performance it brings
LINQs Skip().Take() is a great when you want to export a subset of a data source. A typical use case is to add paging to data tables where users can determine how many elements they want to be displayed at the same.
var collection = Enumerable.Range(0, 5); // 0, 1, 2, 3, 4 var chunk = collection.Skip(2).Take(2).ToList(); // 2, 3
Skip() and Take() is implemented as extension methods on IEnumerable. Most of data structures that implements IEnumerable does not have indexed access, as that is limited to structures such as as List and Array. The implementation of Skip() cannot access elements at position N directly. Instead, it would use IEnumerable.GetEnumerator().MoveNext() subsequently to move to the first element that should be included in the result set. To access element N+1, N calls to MoveNext() will be made. The table below shows run times for splitting a dataset of N integers on my local computer. The runtime drastically increases with number of elements.
Number of elements | Skip().Take() |
---|---|
25 | 3ms |
250 | 5ms |
2500 | 12ms |
25000 | 324ms |
250000 | 18284ms |
How can we improve? With indexed access, of course.
The following code snippets shows an extension method on IEnumerable that converts the source to an Array and utilizes direct access based on index. No more need to move through all preceding elements with GetEnumerator().MoveNext().
public static IEnumerable<T> ChunkOptimizedForLargeDatasets<T> (this IEnumerable<T> source, int skip, int take){ var list = source as IList<T> ?? source.ToArray(); var upperLimit = Math.Min(list.Count, skip + take); for (var i = skip; i < upperLimit; i++) { yield return list[i]; } }
Let's compare the run times with our custom implementation.
Number of elements | Skip().Take() | Custom |
---|---|---|
25 | 3ms | 2ms |
250 | 5ms | 1ms |
2500 | 12ms | 1ms |
25000 | 324ms | 2ms |
250000 | 18284ms | 16ms |
When to use?
Given the low performance gain for smaller datasets it would only make sense to introduce a custom implementation to projects operating on very large datasets in memory. For those project, this implementation could be used where the initial cost of converting an IEnumerable to indexable data structures is low compared to the performance gain of index access.