The core of the newest suite of components that we are building is an authentication and authorization service. Its goal is simple, provide a central location for identifying clients and the data sets and users that they own, for the purposes of using remote, cloud based resources. It forms part of a greater strategy of helping to get previously client side only data into a location where it can be used by mobile applications, websites and third party integrations.
This service also includes the management of application keys, allowing us to control who is able to access the cloud components, and uniquely identifying where requests are coming from. Typically we generate a unique application for each piece of software that wants to interact with the system, including third party software and our own.
Most of our services up to now have been built using the Nancy framework. The main reason for this is because its simple and easy to understand, and gives a lot of control over the entire pipeline, which is important for a number of reasons (one of which is logging).
Unfortunately, when it came time to build the authentication and authorization service, we were unable to use Nancy. Instead we had to use Web API. Now, don’t get me wrong, Web API is a fine framework, but the majority of our experience lay with Nancy, so we might not have put together the best Web API implementation. We were forced the use Web API because the component we were going to use for the core of the authentication (the ASP.NET Identity Framework) integrated easily with Web API, but didn’t seem to have any obvious way to integrate with Nancy (when hosted in ASP.NET).
Really, all of the above is just background to the problem that I’m currently struggling with.
Leaking connections to the persistence store.
A Nice Deep Pool
The backend of our authentication and authorization service is a PostgreSQL database hosted in AWS using RDS. We chose PostgreSQL because SQL Server is generally too expensive (even hosted through AWS) and because there was a PostgreSQL provider for Entity Framework that seemed solid (Npgsql). PostgreSQL is also an extremely mature database engine, and is just far enough out of our comfort zone to provide some new learning opportunities.
Because of the way the Identity Framework uses Entity Framework, all we had to do was create the DbContext using the appropriate connection/provider and EF would take care of the rest, seamlessly persisting our data into the appropriate place.
To be honest, it was actually pretty seamless. We didn’t have any issues (that I can remember anyway) and everything seemed to go pretty smoothly. The service wasn't being actively used in a production environment (because we were still developing the solution it was built for), but it was getting regular traffic and handling it admirably.
Then we did some load testing and it crashed and burnt.
Investigation showed that the service stopped working because it could no longer open new connections to the database, as the database already had too many connections currently in use. We were only using a d2.t2.micros RDS instance, and it has a cap on the number of connections (approximately 26), so that made sense. As soon as there was a greater number of requests trying to access the database than the available number of connections, requests would fail.
A quick investigation showed that the database connections being used for EF did not have pooling enabled. The pooling supported in Npgsql is an entirely client side structured, allowing for efficient reuse of connections by avoiding the work that goes into establishing the pipeline. It also offers the ability to queue the creation of additional connections once the configured pool limit has been set. Obviously there is a timeout (to prevent deadlock), but this allowed us to set the connection pool limit to close to the RDS limit, safe in the knowledge that subsequent requests would simply be slower, rather than failing altogether.
With pooling enabled, the load tests passed with flying colours. I knew that we would have to scale up the RDS instance based on real traffic, but that could happen later. The pooling allowed the service to make better use of limited resources.
For now, problem solved.
These Are My Connections! Mine!
After running happily for many weeks, the staging service went down hard. No load tests involved this time, just normal traffic. Shortly afterwards (within days), a newly provisioned production server did the same thing.
Looking at the logs the symptoms seemed to be similar, errors related to being unable to obtain a connection because the server did not have any available. There was no real sign that the service was doing anything particularly difficult that would require it to hold the connections for an extended amount of time, so we were a bit confused.
Being that the service had been running happily for a number of weeks, it must have been something we changed recently. Unfortuntately, the only recent change made was a fix to a threading issue with Entity Framework, Owin and the Identity Framework (2 threads attempting to lazily initialize the same DbContext at the same time). It was a fairly trivial fix. Instead of returning the DbContext from the factory straightaway, make sure that the model has been initialized first.
It was time to dig into it in more depth.
My hypothesis was that if the application was leaking connections to the database (such that they were hanging around but no longer in a usable state), then I should be able to see them when analysing a memory dump. I started up the Auth API locally using the SciTech .NET Memory Profiler, took an initial snapshot before doing anything meaningful, and then started the stress tests.
Every 15 minutes I would take another snapshot of the memory space, specifically looking for any DbContext or NpgsqlConnection objects that were hanging around for longer than they should have been.
What I found was interesting.
Slowly but surely a collection of NpgsqlConnection objects were accumulating. They were disposed, but could not be garbage collected because they were being rooted by an event handler inside the Npgsql connection pooling implementation. I’m not actually sure if they were holding on to server side resources (being disposed, you would think that they wouldn’t be), but it didn’t look promising.
I turned off connection pooling and those objects stopped being accumulated. Obviously the service also stopped working properly once it reached a certain amount of traffic (when reaching the connection cap of the RDS instance), but it did at least confirm that the issue seemed to be within Npgsql itself and not in our code.
Luckily, a new version of Npgsql had been released (we were using 2.2.5, but version 3 was now available). I removed all of the references to the old package and installed the new one. After dealing with some differences (some very strange errors relating to attempting to use disposed DbContexts during Owin cleanup due to the unnecessary usage of SaveChangesAsync), everything seemed to be working just fine. All of the functional tests passed, and the stress tests didn’t show anything interesting unless I tuned them up to “crash the computer” level.
Pushed, built, deployed.
Where Have All The Connections Gone
Still leaking connections. Still fails when some critical mass of leaked connections is reached.
Come back next week for part 2, when I hopefully have an actual answer as to where the connections are going.