Automated testing in production, 5 minutes at a time

Last week we started getting alerts that some of our requests to S3 were timing out. We discovered that the bucket was absolutely full of S3 Delete Markers under the same prefix and our requests were timing out. The fix was simple enough, we weren't deleting the delete markers, but I want to share how we discovered that this was happening.

At Countingup we have a really strong culture of testing. Our backend is built up of Go microservices; we've got something like 30 services running in a Kubernetes cluster serving up JSON web APIs for our app. We have a very special service running in this cluster that performs 'testing in production'.

That service is called 'system-verification'.

System-verification

This service has a suite of tests, written in Go, that run every 5 minutes in all of our environments (including production) - and reports the results through logs and a JSON endpoint. The tests perform comprehensive, realistic actions that our small business owners and their accountants would do. Sign up as a new customer, issue an invoice, make a payment transaction, create journals, respond to a 3D Secure challenge etc. If it fails to perform these actions or if the responses don't look right we fail the test.

There are a couple of ways we can do this. The simplest would be for system-verification to make requests directly to the services (Kubernetes pods) using the built-in DNS features of Kubernetes. This would be fast and require the fewest moving parts but it's not actually very helpful. We're not trying to catch buggy code with system-verification, it's more about facilitating continuous checks on the health and correctness of our system as a whole. The infrastructure that allows the app to talk to our services looks like this.

If system-verification talked directly to the services we'd be missing out on the bigger picture. We wouldn't know

Is CloudFront up?
Is the Kubernetes ingress set up correctly?
Is there extra latency between our services and the database?

A much better approach is to treat system-verification as a first-class consumer of our API. Have it send requests out of the cluster over the public internet to the exact URLs the app uses.

Test suites

We've taken inspiration from Go's testing mechanism and created our own runner and test interface which implements the same functions as testing.T so we can take advantage of test assertion libraries like testify and write tests that look like familiar Go tests. Each group of related tests is bundled into a test suite, here's an example of our Invoices test suite:

p.Testsuite = systesting.NewSuite(
	"Invoices",
	[]systesting.TestFunc{
		p.Create_draft_invoice,
		p.Update_draft_invoice,
		p.Preview_invoice,
		p.Issue_invoice,
		p.Verify_issued_invoice,
		p.Email_invoice,
		p.Verify_invoice_fully_matched,
		p.Delete_match,
		p.Cancel_invoice_without_a_credit_note,
		p.Create_and_issue_second_invoice,
		p.Cancel_invoice_with_a_credit_note,
		p.Verify_issued_credit_note,
		p.Verify_credit_note_in_ledger,
	},
	p.Cleanup)

We can also skip an entire test suite based on the failure of another suite. For example, we would want to skip creating an invoice if we failed to sign up a user as there'd be no one who could issue that invoice so there's no point trying.

Countingup doesn't have a dedicated QA team, the developers are fully responsible for writing the tests so we're keen to make that as easy as possible. All our service endpoints are defined using Swagger and we auto-generate the Go client code from that so each test can perform whatever actions it needs in a type-safe, easy to read way. Very cool.

We've got 14 test suites right now, each of which has about 10 or so individual tests so the coverage is good but we can always do better!

Cleaning things up

Obviously we don't want to leave any test data lying around in our production environment. We can use the Cleanup function to clean up after each test suite. Typically each Cleanup function will delete all the data that was created by the test suite via endpoints on each service which are only accessible to system-verification and only for data created by system-verification. This makes it impossible to delete non-test data.

I won't go into the specifics of how we know what data is test data and what's real as we're planning a dedicated blog post for that in the future.

What happens when tests fail?

We log any test failures and these automatically get shipped to our centralised log platform which alerts us through Slack. There's a JSON endpoint that returns a 200 OK if the last test run passed and a 503 if it failed along with the names of the failed tests. This is visible through our internal admin site so it's super easy to see what went wrong. Pingdom is also set up to check this endpoint and issues alerts to the technical support team.

Typical failures

Because system-verification exercises our real, production infrastructure we get insights into problems happening inside and outside of our system immediately and without our customers needing to let us know. Some examples would be:

AWS CloudFront was misbehaving recently and kept sending 502 responses, we raised an AWS support ticket and they took the edge server out of rotation. We wouldn't have spotted that if system-verification communicated to the services internally.
AWS S3 bucket latency, this is what we saw last week - although this is typically a one-off because S3's amazing.
Downtime from some of the third-party services we use for payments, identity verification etc.

Remember, we're not trying to catch buggy code with system-verification, we've got plenty of unit and integration tests for that.

Closing 👋

Of course there are other tools and processes we use to ensure reliability, security, correctness etc. I'm sure we'll write about these in due time but I wanted to focus on one particular part today, system-verification.