A strange bug on AWS Lambda - Jon Bevan's Writing

Originally posted at http://labs.adaptavist.com/code/2018/10/09/a-strange-bug-on-aws-lambda/

At Adaptavist we run a lot of Lambda functions to provide functionality in our Cloud apps. We use both the Java and Node.js runtimes for lambdas, and recently we came across a weird bug in production that was caused by the way that the lambda function runtime is re-used in AWS.

As part of our ScriptRunner for Jira Cloud app, we run a service (built using lambda functions) that processes some input data from our customer’s Jira instances and then PUTs the output back into our customer’s Jira instances. Part of the processing logic requires us to have a list of the fields that exist in our customer’s Jira instances in memory during the processing. This lambda function is triggered, in parallel, once per user per Jira instance, for a subset of the users in each Jira instance.

Lets say we have 10 customers and 2 users per customer that need this service. We would trigger the lambda function that does the processing 20 times in parallel. The processing takes usually less than 30 seconds, but we run the process every minute. That frequency is important.

I mentioned that we need a list of fields in memory. When we first built the service we fetched that list on demand for each item in the input that the lambda receives. Recently we started caching that list of fields, so that we would only need to fetch it once per lambda invocation.

The lambda uses a node module that our frontend code for ScriptRunner uses and it was inside this module that we added the caching.

Once we deployed this change, everything seemed to be fine, until we got some support requests along the lines of “the service is telling me the fields are invalid”. We checked the logs and sure enough there were lots of error reported about fields mentioned in the lambda input being invalid fields – ie not in the list of fields we should have been fetching from Jira.

I mentioned at the start of the blog post that we use Java and Node.js runtimes, and when developing our Java lambdas we had been very cognizant of the fact that a single lambda instance can have its runtime frozen once it completes and that a second request a short time later can trigger the runtime to be defrosted/thawed in order to process the second lambda request and avoid a cold start, therefore we should clean up after ourselves nicely in case the JVM is re-used.

For our Node.js lambdas we hadn’t really thought that much about it, but once we saw this bug in production I had a hunch what might be happening.

Because we run the service so frequently (every minute) the lambda function is always “hot”, which means the Node.js runtime is shared between invocations of the function. We had introduced caching into a node module and that module was not being re-initialized because the runtime (and therefore data stored inside modules) was being kept between invocations! Because we only have one lambda function for this process, the first time it ran it cached the field list from one Jira instance and all the subsequent invocations of the lambda that were supposed to be for other Jira instances just used the cached data which didn’t match the input!

So, if you cache anything in a lambda function, make sure you think about the possibility of the cached value being shared between invocations of that lambda!