Just another day at Engineering

Sameer Soni
4 min readJan 26, 2022
Photo by Clément Hélardot on Unsplash

I am working with the SRE team and we ensure production systems are working smoothly and if there any bugs or performance glitches identified, we take on projects that addresses them and improve the overall reliability and resiliency. This article is about one such project which we delivered and some learnings we had along the way.

In one of our mobile app features, the customers can view all information and status of the referrals they have made. This information needs to pulled by applying joins across several tables and it is relatively a heavy query.

Our SRE team suggested that since this is a read-only operation, it can be fetched from the Read-Replica database, instead of main database and it would de-risk overloading of main database when customers interaction increase. Of course, this feature needs to be improved, but reading from Read-Replica offers a reliable interim solution until we overhaul the functionality altogether.

So, the engineers discussed this and the SRE team decided to pick it up. They looked at the code, created a new database connection to read replica and changed the source for this API from main database to read replica database using the new connector. The feature got quickly shipped out. Sounds simple and straight forward.

The feature was released early morning and day started looking good when main database showed improvement in CPU usage trend. We had a few 5xx response for this API, but since they were within error budgets, the team didn’t investigate further.

Around evening, we realised that the load on read-replica database is very high as customers had started using this feature. Since our read replica instance capacity was lesser, we planned to upgrade it.

During the log review activity, we saw that multiple requests for this transaction API were failing with error below.

SQLalchemy.exc.TimeoutError: QueuePool limit of size 20 overflow 10 reached database

This resulted in a lot of HTTP 5xx errors.

We have alarms configured within our infrastructure, which triggers an alert, when two HTTP requests return 5xx response within 5 minutes window. This alarm can be very dangerous and we learnt it the hard way.

We had a situation where roughly 25 HTTP requests were failing within 5 mins(which was way beyond our error budgets), but there were no alerts getting triggered. And the reason was, crossing of threshold. The infrastructure alarms are triggered when any threshold configured for that alarm is crossed. If it remains in the crossed state you don’t get notified again. So despite the API failures, the teams were not getting notified.

But as the above SQLAlchemy error kept occurring, we knew that there is something we had missed out with this new feature change.

We use Python with Flask for our application, and unlike JPA or Django, where session management is automatically taken care of, here we need to ensure that after the request is completed, the session object is released and connection pool get recycled for serving next request.

Here, the team missed this fact, and forgot adding the statement to close database connection after the HTTP call. Since, the lower environment has limited test users, catching this bug was difficult, as production like situation was not simulated.

Only when the feature was live on Production and the database connections got initialised, few APIs succeeded but a lot of them started failing, as connection pool size was not getting recycled.

After an hour of code review and risk assessments with various solutions (rollback was one of them), the team decided to release the hotfix on the same evening. As traffic was increasing, and unless the code is changed to recycle the connection pool, a lot of other customers would face this issue. So the team created a hotfix branch, pushed it to VCS, the Jenkins pipeline triggered the job and created the new docker image ready for deployment in few minutes.

Now it was already rush hours for business, and though we could have updated the services by triggering the production release pipelines, it would have resulted in updating all other services as same docker image is used by multiple services. In principle, it would have been a cleaner way but it could have risked with some other issues. Our objective was to update just one of 20+ services that are running as a part of docker swarm.

So we executed the below command to achieve this:

docker service update — force — image org/service:hotfix_tag service_name

This ensured, that only the particular service was updated and others remain untouched. Soon after execution, the SQLAlchemy errors disappeared, and CPU load on Read Replica database also reduced.

The 5xx errors were finally down to Zero and the team took a deep breadth of relief.

Key take aways:

  1. Review alarms, so the alerts are triggered whenever systems are under stress.
  2. Handle database session related code change with utmost care as these are one of the most difficult issues to troubleshoot.

Thanks for reading and if you also have some interesting story to share, you can reach out to me on LinkedIn or Twitter.

Happy Coding.

--

--