The products I work with most, ScriptRunner for Jira Cloud and ScriptRunner for Confluence Cloud, both run on the same set of AWS infrastructure.
The public-facing web services handle ~19 million inbound requests per day and make at least 120 million outbound requests to Atlassian on behalf of our customers. We decided to move them out of a “legacy” shared AWS account, that was running a number of different workloads for different products, and into their own set of AWS accounts: one for each of the three environments we deploy into (dev, staging, production). Downtime wasn’t an option we could realistically consider.
We needed to move workloads running on ECS and Lambda, and replicate 100s of GBs of storage in DynamoDB and S3 all while the product was being actively used by customers.
We decided that the new service hosting infrastructure would run behind new URLs and therefore DNS records. This isolated the new AWS accounts better and gave us more control because we could manage the traffic migration ourselves instead of having a single “big bang” DNS change.
The new infrastructure would differ from the original in one major way: the entry point for all requests would have a new routing service with feature flagged routing logic. If the feature flag was off, traffic would be sent to the old/legacy AWS account.
This was the first piece of infrastructure we were able to stand up and test in the new AWS accounts and start using with production traffic before anything else in the accounts was ready. Although we know that products like nginx and haproxy are designed for this kind of problem-solving we didn’t have experience with those products. We’ve run a battle-tested Ratpack-based proxying service for outbound traffic for several years, so we cloned, simplified, and deployed that instead.
Once we had the router in place we released a change to our customers which moved them all over to the new URLs where all traffic was then redirected straight back to the original AWS account. The URL change took ~24hrs to roll out due to the way Atlassian pushes updates to Cloud customers so we were able to monitor and validate the behaviour of the routing services as the request load scaled up.
While we implemented the routing setup just described, other members of the team involved in this project reconfigured our build pipelines in both Bamboo and Bitbucket Pipelines to deploy all our existing infrastructure into the new AWS accounts. This included several Terraform repos covering the supporting/shared infrastructure like load balancers, caches, DynamoDB tables, the workloads on ECS plus several other repos using AWS SAM for our Lambda-based services. We had to make application changes to some services so they used the same feature flag that our router was using to ensure that scheduled business logic did not yet run in the new infrastructure. If we didn’t do that then some scripts and background processes would have started running in both sets of infrastructure at the same time.
For the live replication of customer’s data we initially looked at an AWS blog for a DynamoDB replication but the defaults were quite expensive and slow: we racked up a $12k bill over a weekend. We could tolerate a decent amount of eventual consistency with database writes reaching the new AWS account so “all” we had to do was reimplement a more cost and performance-efficient approach and then leave it running.
For data stored in S3, we first tried Amazon’s built-in cross account replication but the lag on that was 10s of seconds in duration (peaking at 40s) which was way more than we could tolerate. Instead, we opened up access to the S3 buckets in the new AWS accounts and reimplemented the relevant part of our Java services so they were able to write to multiple S3 buckets if the relevant feature flag was enabled, providing replication of data with milliseconds of latency. The mistake we made here initially was forgetting to set an “Access Control List” of “Bucket Owner has Full Control” so that cross-account writes could be fully accessed by the target AWS account.
Once everything was being deployed and data was being replicated into the new AWS account in a timely fashion we duplicated all the automated testing that our deployment pipelines rely on to give us confidence in continuous deployment, so that the new infrastructure was being tested at the same level as our existing infrastructure. Anecdotally we noticed that flakey tests passed more often in the new infrastructure…
Using the aforementioned feature flag we could run tests against the old infra or the new infra on a whim and also test what happens when the flag is switched on or off. Some of you may have noticed that we only built data replication going in one direction (from the old account to the new account), so once we enabled the routing feature flag for a customer we would not be able to turn the flag off for that customer without some data loss. We had decided that building bi-directional replication between the two AWS accounts was asking for more trouble than it was worth. This raised the stakes a bit.
Once we enabled the feature flag in the router service our new infrastructure handled the increase in production traffic really well. Since then we’ve removed the feature flag from the router (no going back now!) and have started scaling down and removing AWS resources from the old account. We’ll be disabling the DynamoDB data replication and removing the business logic that writes to multiple S3 buckets in different accounts when we’re certain no more traffic is hitting the old account. The feature flag that prevents scheduled business logic from running will be removed once the old infrastructure no longer exists.
So there you have it, a lot of careful planning and preparation meant everything worked as expected. Lessons learnt include: not necessarily trusting AWS’s recommendations, validating solutions before committing to them and relying on the tech you know reduces unnecessary risks.