How I managed to encounter and recover from Computer Science’s Two Hardest Problems

Back in August 2019, I managed to encounter both of computer science’s hardest problems: (1) cache invalidation and (2) naming things, while trying to do something relatively simple. I wanted to share my experience and the lessons I learned from that situation.

Problem Background

With any public facing application that experiences medium to high traffic load, you have to implement some form of caching layer. The company I was with has always been good about caching in certain instances. Back in December, I expanded our caching infrastructure for our database using PHP’s PSR-6 standard and the PHP Cache library to expand our use cases. As we’ve been developing new features or addressing issues with old features, we’ve been further expanding on that work to improve the site’s overall performance.

In this particular instance, I was implementing some caching on a fairly prominent, legacy data access method. The method in question looked like the following:

Note: I’ve omitted real variable names and entity types to avoid any IP infringements.

For those that don’t read PHP regularly and those that don’t know our codebase, this method tries to find all entities by a foreign key relationship A. If either a value for relation B or relation C is passed in, then the method attempts to filter by those values as well. If relation A is passed an array of IDs or relation B or C exist, then all the relevant results are returned. Otherwise, it tries to return the first matching value as it expects only one entity to exist.

To be clear, this function should really exist as 4 different functions, but due to its prevalence, usage in the codebase, and scope of the problem, refactoring was out of the questions.

Problem 1: Naming Things

Naming Issue 1

As previously stated, this returns only 1 element if only a single value for relation A is passed in as an argument. According to PHP docs, reset “set[s] the internal pointer of an array to its first element”. The gotcha is its return value. According to the doc, it “returns the value of the first array element, or FALSE if the array is empty.” So in PHP, anytime you see $x = reset($arr); or return reset($arr);, it’s actually behaving like a first method or return $arr[0];.

Without knowing this, when I went to update this function to work from cache, I also removed the reset and ended up with the following:

As a result of leaving out the reset, the function started returning an array in the case that previously expected a single value. This caused some issues upstream. The fix we decided on was to convert the final return statement to read return _::first($result); which was our replacement for reset to help clarify the intended behavior.

There are 2 main takeaways from this problem:

  1. Having a fully functioning testing suite would have been ideal as this problem would have been caught by CI and any future issues would have been caught.
  2. Using clearly named functions is important. reset and first have different meanings to me, but to a seasoned PHP developer, these are synonymous. Regardless, to make entry into a codebase easier, comments on lines with ambiguously named functions is a must.

Naming Issue 2

There was another issue around naming that wasn’t super obvious and we didn’t catch it until the previous naming fix had been shipped. Relation B and C are being filtered by variables $relationBId and $relationCId. Reading these variables, I thought they were always integer values. However, reading the function’s doc comment, these values can actually be an integer, an array of integers, or null. All the null cases were handled already and were easily identified, but how was this working before with arrays?

The answer lies in how PDO (PHP Data Objects) works. PDO’s where method allows for an integer, a string, or an array to be passed in, and it determines whether to translate this to a SQL = or IN clause under the hood. So when I translated:

to:

I changed the underlying behavior of the filtering which caused a second bug I needed to run a rollback. The fix for this issue was to force the ID(s) variables to always be arrays before filtering and use a _::contains checker instead. The final result looks like:

Again, having tests would have 100% helped avoid this problem, but given that we didn’t have a working test suite other measures needed to be taken. The first would have been to have RTFM’d and paid more attention to the doc comment telling me the actual behavior. However, I argue the variables should have been named in the plural, e.g. $relationBId should have been $relationBIds, thus giving any reader the required context.

Problem 2: Cache Invalidation

So far, we’ve only discussed the naming issues problem, but where does the cache invalidation come into play as a problem? In theory, deploying a cache version of a method should be safe so long as the method wasn’t previously cached. However, rolling back from a bad change and enabling and disabling the cache between deployments causes a huge cache invalidation problem.

Here’s a timeline of the actions that were taken:

  • Thursday 4PM: Deploy the cached version with the missing _::first call
  • Thursday 4:30PM: Rollback the previous change
  • Friday 9AM: Deploy the fix with the _::first
  • Friday 9:30AM: Rollback the previous change
  • Friday 9:45AM: Deploy the final fixed version

When we ran the Friday 9AM deploy, we didn’t see any issues with the cache, but when we deployed again at 9:45AM, we started seeing massive problems and error spikes.

When I deployed the change at 9AM, we started updating our cache on each write so that entities were properly written or invalidated. However, when I ran the rollback, I disabled the caching mechanisms temporarily until I could redeploy the final fix. This meant that for ~15 minutes, our cache was not being correctly updated or invalidated, so when I deployed the final fix, we were reading from a poisoned cache where some entities had been updated in that period of time but the cache was providing outdated information.

A question you might have is “why did you not see any issues when you did the 9AM deploy around the cache?” The reason is our cache was setup with a 2 hour expiry so all of the records written to cache Thursday evening had already expired and were not an issue.

We had a few takeaways from this:

  1. Whenever you have to rollback a deployed cache, you must bust it on the next deploy.
  2. A safer strategy for deploying caches is to do 2 stage deploys. In stage one, you deploy the cache write. In stage two, you enable the cache reads. This allows you to rollback the deploy cache reads without rolling back the cache writing mechanism which avoids this problem holistically. This solution took some extra code to accomplish but was well worth the effort of implementation.

Learnings & Takeaways Summary

RTFM

If someone was nice enough to leave you documentation next to your method, take the time to read it. There’s probably a good reason they did that for you.

Name variables better

This is debatable given the previous learning, but I think if you’re defining a variable that can be an array or single value, name it in the plural form to reflect all the options better. Additionally, using better named functions can lead to an easier understanding of your intent to avoid other problems.

Bust caches when redeploying a cache

Whenever you have to change a cache write method or roll it back then redeploy it, you should just go ahead and bust that cache to avoid poisoning it. In my opinion, it’s better to slow the site down for a minute while caches refresh than to try and fix any issues arising from the cache being poisoned.

Single Responsibility Functions

As previously mentioned, this function was rather bloated with all its flexibility. It would have been better to have 4 separate methods for each of the different use cases so each function could be rewritten clearly. There’s also a debate to be made that these pieces could have been abstracted into different layers as some of the intent was business logic and not database lookup logic.

Write Unit Tests

Regardless of all the other learnings, I think having a working unit test suite would have saved all the headaches as each of the different use cases should have been under test already and refactoring to use a cache could have been handled before any code was ever deployed to production thus causing the rollbacks.

Conclusion

I hope you’ve learned something from my experience and take those lessons into your development processes. Let me know in the comments of other situations like this you’ve run into in your experiences or ideas you have. Happy coding! :)

Engineering Manager at https://labs.thisdot.co/