Aurora MySQL v2 upgrade

Databases are critical pieces of infrastructure, underpinning the vast majority of applications. They can do a lot of things, but essentially they act as a way of remembering information. The thing is, remembering information is so fundamental that everything ends up relying on it. Just think of Google - web indexing, maps, videos, emails, etc all rely heavily on data storage.

At Countingup, we use Amazon's Aurora MySQL as our primary data storage and with v1 reaching end of life, some of our developers (myself included) have been busy preparing to upgrade to Aurora MySQL v2.

In this post, I cover how we use Aurora, some of the challenges we've had to address upgrading to v2, and what steps we've taken to prepare for future upgrades.

Aurora MySQL and Countingup

Countingup runs using a microservice architecture of over 30 services, each of which has its own database. Each service runs inside a Kubernetes cluster, with its database inside our Aurora MySQL cluster.

A service's database is private, it is only available to the service itself and we enforce this by giving a service its own unique credentials to access the database server.

Our services are unaware of these clusters or the DB engine, they only care about managing their data storage via a MySQL database.

Both clusters provide scaling for performance and reliability, be it Pods for Kubernetes, or replicas for Aurora.

Why are we upgrading?

Aurora supports each MySQL LTS for at least three years. v1 is scheduled for deprecation on Feb 2023 as per the LTS support.

There are two phases to the v1 end of life:

September 27, 2022 it won't be possible to create new Aurora MySQL v1 instances or add secondary regions
February 28, 2023 Aurora MySQL v1 clusters will be automatically upgraded to v2

We plan to be on v2 before the September 27 changes for two reasons.

Firstly, adding a secondary region is a key step for performing disaster recovery when the primary region is no longer available. This would mean we would be impacted a lot more by an unplanned outage.

Secondly, not being able to create v1 instances would limit the testing we could perform, i.e. performing upgrade test runs by upgrading a clone of our cluster.

What's the plan?

AWS provides great documentation around upgrading from v1 to v2 and supports two strategies:

In-place - a straightforward upgrade path, first made available back in Jan 2021
Blue-green - a zero-downtime upgrade path involving replicating between the existing (blue) cluster and a newly upgraded cluster (green).

The main con of the in-place upgrade is the unavoidable downtime. The time will vary depending on a variety of factors, generally influenced by how busy the database is. Depending on the situation you may find any downtime unacceptable and opt for the blue-green strategy.

In our case, we found the in-place upgrade to take around 20 minutes. We were able to know this by spinning up clones of our production cluster (very easy to do with RDS) and testing the in-place upgrade against them.

With a pragmatic mindset, this was more than acceptable, given the added complications in the case of blue-green. This avoids managing binlog replicating between different MySQL versions and having to reconfigure our services to switch cluster endpoints at the end of it all.

What's changing?

Upgrading from Aurora v1 to v2 requires moving from MySQL 5.6 to 5.7.

| Aurora MySQL | Compatibility |
| ------------ | ------------- |
| v1           | 5.6           |
| v2           | 5.7           |
| v3           | 8.0           |

The upgrade to 5.7 is fairly straightforward, only requiring a binary change with some configuration changes. The existing stored data remains unaffected.

However, it wouldn't be a major upgrade without some breaking changes, so let's take a look at some standouts introduced by the configuration changes.

sql_mode & strict mode

sql_mode affects supported SQL syntax and data validation checks. It can be set at the global and session scope, but as of 5.7 more of these modes will be enabled by default.

Strict SQL Mode affects the behavior of data-change statements like INSERT and UPDATE when invalid or missing values are provided. Strict SQL Mode is enabled if either of the following sql_modes are enabled:

STRICT_ALL_TABLES
STRICT_TRANS_TABLES

MySQL 5.7 enables some sql_mode options by default. Some notable ones are:

STRICT_TRANS_TABLES - meaning strict mode is enabled by default
NO_ZERO_DATE - When used in conjunction with strict mode, prevents dates from being set to 0000-00-00

In further versions of MySQL like 8.0, some of these modes will be deprecated, becoming part of the strict mode. It is for reasons like this that we would like to get on top of these modes, instead of simply toggling them back off.

The rest of this post will focus on the data validation checks enabled by default in MySQL 5.7, what their impact was and how we resolved them.

Go Zero Values

At Countingup, we use a microservice architecture virtually all of which is written in Go. We use Go notably for its small footprint and built-in concurrency handling.

As such, our backend has been developed with Go-isms built-in, making use of Go's zero values including zero value times.

time.Time{} // 0001-01-01 00:00:00 +0000 UTC

There's a whole 🐇 hole involving the Go MySQL driver and the many nuances around date handling by MySQL itself that meant, under the new sql_mode defaults, INSERT and UPDATE statements with these zero value times error.

To abstract away from this and to bridge the gap between our Go-isms and the stricter validation of the DB we introduced a wrapper type around time.Time which safely marshals between the two - treating zero values as NULL. Expect a deep dive into this in a later post!

Multi-DB integration tests

As mentioned earlier, we have over 30 services running Countingup, each one having its own set of integration tests.

We run our integration tests using Docker. Running the service, its dependencies and the database inside containers by way of docker compose.

This works well, however our database image used for integration tests was using MySQL 5.6.

version: '2.1'
services:
  db:
    image: "mysql:5.6"
    ...
...

With the database needing to be upgraded, we would like for our tests to also verify our service works with MySQL 5.7 too.

So we needed to support testing against two versions - 5.6 to ensure our services were still compatible with our current cluster version and 5.7 for after the v2 upgrade.

We achieved this by basing the database image on an environment variable, which we could control as part of the setup for the integration test. Taking advantage of the containerised infrastructure we can easily swap out one image for another with the services being none the wiser!

version: '2.1'
services:
  db:
    image: "mysql:${DB_MYSQL_VERSION:-5.6}"
    ...
...

This way we could run our suite of tests once against a 5.6 image and again on 5.7.

Running in CI

We use SemaphoreCI for our CI/CD pipeline. Before a service can be deployed it needs to pass an integration tests step in the pipeline. As part of the upgrade work, all service integration tests now include a Job Matrix to run tests across MySQL versions.

  - name: Integration tests
    dependencies: [ "Build go" ]
    task:
      ...
      jobs:
        - name: Integration test
          commands:
            ...
          matrix:
            - env_var: DB_MYSQL_VERSION
              values: [ "5.6", "5.7", "8.0" ]
      ...

Each job configured this way runs in parallel in an isolated machine. We use it to test for forward compatibilities, as well as any regressions in our current 5.6 version.

Future proofing for Aurora MySQL v3

In the previous section, you may have noticed 8.0 was included as a value in the Job Matrix. As we have done with the v2 upgrade, we can add support for a future upgrade to v3 which will be MySQL 8.0-compatible.