Software Design Blog

Simple solutions to solve complex problems

10x Performance Gain: IEnumerable vs IQueryable

This post compares IEnumerable against IQuerable using an experiment to illustrate the behaviour and performance differences. Spotting a func vs an expression func filter bug is easy to miss. The caller’s syntax stays the same but it could have a 10x performance impact on your application.

Download Source Code

Setup

SQL Server 2014 was used for hosting the database. The GeoAllCountries table content was sourced from GeoNames and contains just over 10 million rows. Entity Framework 6 was used for the LINQ to SQL integration.

Predicate Function

The code below will query the GeoAllCountries table and use a filter predicate function to filter the results starting with "Aus".
        static void Main(string[] args)
        {
            var stopWatch = Stopwatch.StartNew();
            var countryNames = GetCountryNames(name => name.StartsWith("Aus"));

            foreach (var name in countryNames)
            {
                Console.WriteLine(name);
            }

            stopWatch.Stop();
            Console.WriteLine("Running time: {0}", stopWatch.Elapsed.TotalSeconds);
            Console.ReadLine();
        }

        public static IEnumerable<string> GetCountryNames(Func<string, bool> filterFunc)
        {
            using (var context = new TestDatabaseDataContext())
            {
                IQueryable<string> names = (from country in context.GeoAllCountries 
                                            select country.Name);
         
                foreach (var name in names.Where(filterFunc))
                {
                    yield return name;
                }
            }  
        }
Running time: 8.6558463
SQL Server Profiler captured the following query between the application and the database:
SELECT [t0].[Name] FROM [dbo].[GeoAllCountries] AS [t0]

Expression Predicate Function

The code below will query the GeoAllCountries table and use an expression filter predicate function to filter the results starting with "Aus".
        static void Main(string[] args)
        {
            var stopWatch = Stopwatch.StartNew();
            var countryNames = GetCountryNames(name => name.StartsWith("Aus"));

            foreach (var name in countryNames)
            {
                Console.WriteLine(name);
            }

            stopWatch.Stop();
            Console.WriteLine("Running time: {0}", stopWatch.Elapsed.TotalSeconds);
            Console.ReadLine();
        }

        public static IEnumerable<string> GetCountryNames(
                      Expression<Func<string, bool>> filterFunc)
        {
            using (var context = new TestDatabaseDataContext())
            {
                IQueryable<string> names = (from country in context.GeoAllCountries 
                                            select country.Name);
         
                foreach (var name in names.Where(filterFunc))
                {
                    yield return name;
                }
            }  
        }
Running time: 0.8633603
SQL Server Profiler captured the following query between the application and the database:
exec sp_executesql N'SELECT [t0].[Name]
FROM [dbo].[GeoAllCountries] AS [t0]
WHERE [t0].[Name] LIKE @p0',N'@p0 nvarchar(4000)',@p0=N'Aus%'
Note that the client code did not change. Adding the expression syntax around the func made a world of difference. It is pretty easy to add the predicate syntax but is just as easy to miss in a code review unless you have the fidelity to spot the issue and understand the implications.

Summary

IEnumerable executes the select query at the database and filters the data in-memory at the application layer.

IQueryable executes the select query and all of the filters at the database.

The database filtering reduced network traffic and application memory load resulting in a significant 10x performance gain.

Butcher the LINQ to SQL Resource Hog

Has your LINQ to SQL repository ever thrown a "cannot access a disposed object" exception? You can fix it by calling ToList on the LINQ query but it will impede your application’s performance and scalability.

This post covers common pitfalls and how to avoid them when dealing with unmanaged resources such as the lifecycle of a database connection in a pull-based IEnumerable repository. An investigation is made to uncover when Entity Framework and LINQ to SQL resources are disposed of and how to implement an effective solution.

Download Source Code

Setup

The following repository class will be used to model the same behaviour as an actual LINQ to SQL database repository.
    public class Model
    {
        public string Message { get; set; }
    }

    public class Repository : IDisposable
    {
        public IEnumerable<Model> Records
        {
            get
            {
                if (_disposed) throw new InvalidOperationException("Disposed");
                Console.WriteLine("Building message one");
                yield return new Model() { Message = "Message one" };
                if (_disposed) throw new InvalidOperationException("Disposed");
                Console.WriteLine("Building message two");
                yield return new Model() { Message = "Message two" };
            }
        }

        private bool _disposed = false;

        public void Dispose()
        {
            Dispose(true);
            GC.SuppressFinalize(this);
        }
        
        protected virtual void Dispose(bool disposing)
        {
            if (_disposed) return;
            _disposed = true;
        }
    }

LINQ to SQL: Cannot access a disposed object

Let's execute the LINQ query below to call the repository and write the results to the console.
        static void Main(string[] args)
        {
            var records = GetLinqRecords();
            foreach (var record in records)
            {
                Console.WriteLine(record);    
            }
            Console.ReadLine();
        }

        private static IEnumerable<string> GetLinqRecords()
        {
            using (var repository = new Repository())
            {
                return (from model in repository.Records select model.Message);
            }
        }

A LINQ to SQL application would raise the following exception:

An unhandled exception of type 'System.ObjectDisposedException' occurred in System.Data.Linq.dll
Additional information: Cannot access a disposed object.

LINQ to SQL: ToList

Let's execute the LINQ query below by materialising the records to a list first:
        
        static void Main(string[] args)
        {
            var records = GetLinqRecordsToList();
            foreach (var record in records)
            {
                Console.WriteLine(record);    
            }
            Console.ReadLine();
        }

        private static IEnumerable>string< GetLinqRecordsToList()
        {
            using (var repository = new Repository())
            {
                return (from model in repository.Records select model.Message).ToList();
            }
        }
Building message one
Building message two
Message one
Message two

Yield to the rescue

Let's execute the code below using yield instead:
        static void Main(string[] args)
        {
            var records = GetYieldRecords();
            foreach (var record in records)
            {
                Console.WriteLine(record);    
            }
            Console.ReadLine();
        }

        private static IEnumerable<string> GetYieldRecords()
        {
            using (var repository = new Repository())
            {
                foreach (var record in repository.Records)
                {
                    yield return record.Message;
                }
            }
        }
Building message one
Message one
Building message two
Message two

Don’t refactor your code

Let's see what happens when we run a refactored version of the code:
        static void Main(string[] args)
        {
            var records = GetRefactoredYieldRecords();
            foreach (var record in records)
            {
                Console.WriteLine(record);    
            }
            Console.ReadLine();
        }

        private static IEnumerable<string>string<string> GetRefactoredYieldRecords()
        {
            using (var repository = new Repository())
            {
                return YieldRecords(repository.Records);
            }
        }

        private static IEnumerable<string> YieldRecords(IEnumerable<Model> records)
        {
            if (records == null) throw new ArgumentNullException("records");
            foreach (var record in records)
            {
                yield return record.Message;
            }
        }

Déjà Vu. The same error occurred as seen in the LINQ to SQL example. Take a closer look at the IL produced by the compiler using a tool such as ILSpy.

In the refactored and the LINQ to SQL version, instead of returning an IEnumerable function directly, a function is returned that points to another IEnumerable function. Effectively, it is an IEnumerable within an IEnumerable. The connection lifecycle is managed in the first IEnumerable function which will be disposed once the second IEnumerable function is returned to the caller.

Keep it simple, return the IEnumerable function directly to the caller.

Yield IEnumerable vs List Building

This post describes the use of yield and compares it to building and returning a list behind an IEnumerable<T> interface.

Download Source Code

Setup

The example consists of a contact store that will allow the client to retrieve a collection of contacts.

The IStore.GetEnumerator method must return IEnumerable<T>, which is a strongly typed generic interface that describes the ability to fetch the next item in the collection.

The actual implementation of the collection can be decided by the concrete implementation. For example, the collection could consist of an array, generic list or yielded items.

       
    public interface IStore<out T>
    {
        IEnumerable<T> GetEnumerator();
    }

    public class ContactModel
    {
        public string FirstName { get; set; }
        public string LastName { get; set; }
    }

Calling GetEnumerator

Let's create two different stores, call the GetEnumerator on each store and evaluate the console logs to determine if there is a difference between the List Store and the Yield Store.

List Store

The code below is a common pattern I've observed during code reviews, where a list is instantiated, populated and returned once ALL of the records have been constructed.
    
    public class ContactListStore : IStore<ContactModel>
    {
        public IEnumerable<ContactModel> GetEnumerator()
        {
            var contacts = new List<ContactModel>();
            Console.WriteLine("ContactListStore: Creating contact 1");
            contacts.Add(new ContactModel() { FirstName = "Bob", LastName = "Blue" });
            Console.WriteLine("ContactListStore: Creating contact 2");
            contacts.Add(new ContactModel() { FirstName = "Jim", LastName = "Green" });
            Console.WriteLine("ContactListStore: Creating contact 3");
            contacts.Add(new ContactModel() { FirstName = "Susan", LastName = "Orange" });
            return contacts;
        }
    }

    static void Main(string[] args)
    {
        var store = new ContactListStore();
        var contacts = store.GetEnumerator();

        Console.WriteLine("Ready to iterate through the collection.");
        Console.ReadLine();
    }
ContactListStore: Creating contact 1
ContactListStore: Creating contact 2
ContactListStore: Creating contact 3
Ready to iterate through the collection.

Yield Store

The yield alternative is shown below, where each instance is returned as soon as it is produced.
    
    public class ContactYieldStore : IStore<ContactModel>
    {
        public IEnumerable<ContactModel> GetEnumerator()
        {
            Console.WriteLine("ContactYieldStore: Creating contact 1");
            yield return new ContactModel() { FirstName = "Bob", LastName = "Blue" };
            Console.WriteLine("ContactYieldStore: Creating contact 2");
            yield return new ContactModel() { FirstName = "Jim", LastName = "Green" };
            Console.WriteLine("ContactYieldStore: Creating contact 3");
            yield return new ContactModel() { FirstName = "Susan", LastName = "Orange" };
        }
    }

    static void Main(string[] args)
    {
        var store = new ContactYieldStore();
        var contacts = store.GetEnumerator();

        Console.WriteLine("Ready to iterate through the collection.");
        Console.ReadLine();
    }
Ready to iterate through the collection.
Let's call the collection again and obverse the behaviour when we fetch the first contact in the collection.
  
        static void Main(string[] args)
        {
            var store = new ContactYieldStore();
            var contacts = store.GetEnumerator();
            Console.WriteLine("Ready to iterate through the collection");
            Console.WriteLine("Hello {0}", contacts.First().FirstName);
            Console.ReadLine();
        }
Ready to iterate through the collection
ContactYieldStore: Creating contact 1
Hello Bob

Possible multiple enumeration of IEnumerable

Have you ever noticed the "possible multiple enumeration of IEnumerable" warning from ReSharper? ReSharper is warning us about a potential double handling issue, particularly for deferred execution functions such as yield and Linq. Have a look at the results produced from the code below.
  
        static void Main(string[] args)
        {
            var store = new ContactYieldStore();
            var contacts = store.GetEnumerator();
            Console.WriteLine("Ready to iterate through the collection");

            if (contacts.Any())
            {
                foreach (var contact in contacts)
                {
                    Console.WriteLine("Hello {0}", contact.FirstName);
                }
            }
            
            Console.ReadLine();
        }
Ready to iterate through the collection
ContactYieldStore: Creating contact 1
ContactYieldStore: Creating contact 1
Hello Bob
ContactYieldStore: Creating contact 2
Hello Jim
ContactYieldStore: Creating contact 3
Hello Susan

IEnumerable.ToList()

What if we have a requirement to materialize (build) the entire collection immediately? The answer is shown below.

  
        static void Main(string[] args)
        {
            var store = new ContactYieldStore();
            var contacts = store.GetEnumerator().ToList();
            Console.WriteLine("Ready to iterate through the collection");
            Console.ReadLine();
        }
ContactYieldStore: Creating contact 1
ContactYieldStore: Creating contact 2
ContactYieldStore: Creating contact 3
Ready to iterate through the collection

Calling .ToList() on IEnumerable will build the entire collection up front.

Comparison

The list implementation loaded all of the contacts immediately whereas the yield implementation provided a deferred execution solution.

In the list example, the caller doesn't have the option to defer execution. The yield approach provides greater flexibility since the caller can decide to pre-load the data or pull each record as required. A common trap to avoid is performing multiple enumerations on the same collection since yield and Linq functions will perform the same operation for each enumeration.

In practice, it is often desirable to perform the minimum amount of work needed in order to reduce the resource consumption of an application.

For example, we may have an application that processes millions of records from a database. The following benefits can be achieved when we use IEnumerable in a deferred execution pull-based model:

  • Scalability, reliability and predictability are likely to improve since the number of records does not significantly affect the application’s resource requirements.
  • Performance and responsiveness are likely to improve since processing can start immediately instead of waiting for the entire collection to be loaded first.
  • Recoverability and utilisation are likely to improve since the application can be stopped, started, interrupted or fail. Only the items in progress will be lost compared to pre-fetching all of the data where only a portion of the results was actually used.
  • Continuous processing is possible in environments where constant workload streams are added.