Antivirus solution for transaction attachments

We strive to help our business owners ensure they maintain good bookkeeping, one aspect of this is enabling and encouraging them to capture transaction evidence (receipts) in the app.

Our customers have always been able to snap photos or upload images of their receipts and attach them to transactions. That's great for physical receipts, but what about PDFs, typically from online purchases? You could still take screenshots, and plenty of our business owners were forced to do that, but it's far from ideal.

We needed to allow business owners to upload file types other than images.

In this post, we will take you through how we integrated antivirus scanning for user-uploaded files and how it allowed us to seamlessly extend our existing transaction attachment feature in the mobile app.

Motivation

Imagine a product manager's dismay when this seemingly simple change has one major requirement - we need an antivirus solution. We handle image uploads though, so why aren't they a problem?

OWASP have a detailed view of the risks of allowing arbitrary file upload. We ran all user-uploaded images through an image processing library and, whilst not perfect, we were happy with this from a security perspective but opening this up to PDFs and potentially other file types in the future wasn't acceptable without some protection.

So we added antivirus scanning to user-uploaded files.

Requirements

On the mobile side, we just need to add the option to upload a PDF in the transaction attachment UI. Pretty simple!

However, the fun comes with protecting the business owner, our admin staff, and our infrastructure from malicious files while ensuring a smooth user experience.

The sequence diagram above shows there isn't much to handling transaction attachments. A Go service will handle the request and aside from some database interaction, the service acts as a thin layer between the app and the S3 storage.

We don't expose the object directly to the mobile app (e.g. using S3 presigned URLs), all transaction attachments are proxied through our Go service which lets us have more control over when and how those objects are accessed.

So, where's the fun?

Protecting both the end users and ourselves means preventing access to any unscanned or infected files. Before we can do that we need some way of determining if a file is safe to access or not.

Having a smooth user experience means letting the business owner get on with attaching / viewing their attachments. Our antivirus solution will need to scan files in real-time and the faster it does so the better. If a business owner uploads a safe file there shouldn't be any indication that we're 'processing' it in any way.

Our Solution

We store transaction attachments as objects in an S3 bucket. S3 has a feature called Amazon S3 Event Notifications. We subscribe an SNS topic to the s3:ObjectCreated:Put event which means anything subscribed to that topic receives a message with metadata about all new objects added to the bucket. In our case, we use an SQS queue that we can poll to kick off a scan job.

The S3 event could be sent straight to SQS without the SNS topic in between, but we needed that extra layer for boring reasons outside of the scope of this blog.

We have a Kubernetes pod running in our EKS cluster which:

Polls the SQS queue and contains the business logic for what to do when a file to found to be clean/infected
Runs Clamd and Freshclam for the actual scanning and virus definitions management.

If an object is scanned and found to be infected we move it to a shared "quarantine" bucket. This diagram shows the parts of the system.

Terraform

Some of the Terraform has been simplified / renamed to make it easier to understand the core principles. Take care to check all defaults are appropriate for your use case.

We have an s3_antivirus module in our Terraform code that's used in all of our environments.

To enable antivirus scanning on a bucket we simply add the name of the bucket to our enabled_buckets list.

module "av" {
  source          = "../modules/s3_antivirus"
  enabled_buckets = [
    "transaction-attachments",
  ]
}

This makes it very obvious which buckets have antivirus scanning enabled.

We didn't enable the scanning on all of our buckets for a few reasons.

We have lots of buckets and we want to incrementally enable this feature where required. This makes it easier to monitor for issues and rollback if we need to.
Many of our buckets only contain backend-generated content. We don't need to scan these buckets so it's a waste of time and would increase the queue backlog for those buckets that do need scanning.

The contents of this module involved in setting up the pipeline between S3 -> SNS -> SQS look like this.

# Get a list of all AV enabled buckets
data "aws_s3_bucket" "buckets" {
  for_each = toset(var.enabled_buckets)
  bucket   = each.key
}

# Create an SNS topic for new objects
resource "aws_sns_topic" "new_object_topic" {
  name = "s3_antivirus_object_created"
}

# For each bucket use that SNS topic as the ObjectCreated event target
resource "aws_s3_bucket_notification" "bucket_notification" {
  for_each = data.aws_s3_bucket.buckets
  bucket   = each.value.id

  topic {
    topic_arn = aws_sns_topic.new_object_topic.arn
    events    = ["s3:ObjectCreated:*"]
  }
}

# IAM policy to allow S3 to publish to SNS
data "aws_iam_policy_document" "send_to_sns" {
  statement {
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["s3.amazonaws.com"]
    }

    actions   = ["SNS:Publish"]
    resources = [aws_sns_topic.new_object_topic.arn]

    condition {
      test     = "ArnEquals"
      variable = "aws:SourceArn"
      values   = [for b in data.aws_s3_bucket.buckets : b.arn]
    }
  }
}

# Apply the policy to the SNS topic
resource "aws_sns_topic_policy" "default" {
  arn    = aws_sns_topic.new_object_topic.arn
  policy = data.aws_iam_policy_document.send_to_sns.json
}

# Create SQS queue
resource "aws_sqs_queue" "queue" {
  name = "s3_antivirus_object_created"
}

# IAM policy to allow SNS to SendMessage to SQS
data "aws_iam_policy_document" "send_to_sqs" {
  statement {
    effect = "Allow"

    principals {
      type        = "*"
      identifiers = ["*"]
    }

    actions   = ["sqs:SendMessage"]
    resources = [aws_sqs_queue.queue.arn]

    condition {
      test     = "ArnEquals"
      variable = "aws:SourceArn"
      values   = [aws_sns_topic.new_object_topic.arn]
    }
  }
}

# Apply the policy to the SQS queue
resource "aws_sqs_queue_policy" "send_to_sqs_policy" {
  queue_url = aws_sqs_queue.queue.id
  policy    = data.aws_iam_policy_document.send_to_sqs.json
}

# Create topic subscription to tell SNS to send messages to SQS
resource "aws_sns_topic_subscription" "sns_to_sqs" {
  topic_arn = aws_sns_topic.new_object_topic.arn
  protocol  = "sqs"
  endpoint  = aws_sqs_queue.queue.arn
}

We've omitted a dead-letter queue and redrive policy here to keep things simple. In reality, we use these mechanisms along with a database table to retry failed scans.

Antivirus Scanning Service

This is a Go service running as a pod in our Kubernetes cluster. It has 3 main responsibilities.

Poll SQS for new object messages
Stream the object body to the ClamAV processes
Interpret and action the result of the scan

The Process

An object is added to an S3 bucket
S3 sends a notification to an SNS topic which is sent to an SQS queue
Our Go service (Kubernetes pod) polls this queue for new messages
On a new message the pod grabs a handle to the S3 object body and pipes this into clamdscan's stdin. This means we don't need to download a potentially malicious file to our file system.
If the object is infected we use S3's CopyObject function to copy the original object into a dedicated 'quarantine' bucket which isn't accessible to anything in our backend except the scanning pod which only has write access. Again, this ensures we never download the file to our file system. We zero-out the offending object and add various S3 Object Tags, ScanResult=INFECTED, ScanTime={now}, etc.
If the object is clean we simply set the tags, ScanResult=CLEAN, ScanTime={now}, etc.
We mark the message as read in the SQS queue

There are some additional checks early on in the process to decide whether an object should be scanned (e.g. no point scanning files that we know weren't user-generated / uploaded) and for handling messages in an idempotent way given SQS's "at-least-once" delivery semantics.

ClamAV 😈

ClamAV is an open source antivirus engine for detecting trojans, viruses, malware and other malicious threats. It's a mature offering and often the go-to solution for virus scanning.

ClamAV itself is actually a suite of various tools. We use three,

clamd, a daemon that loads virus definitions and listens for requests from a client to scan files. This requires a fair amount of memory to store the virus definitions but is able to scan most files in just a few milliseconds.
clamdscan, the client for clamd. We use this to stream the S3 object body into clamd. The clamdscan command returns parsable scan results and an error code indicating the result.
freshclam, a sub-process for regularly updating the virus definitions to ensure clamd checks against the latest known issues.

Our antivirus service Docker file uses the clamav/clamav image as the base image which makes it easy to keep up-to-date with newer versions as we can monitor this using Snyk.

Object Tagging and Bucket Policies

S3 objects can have tags and when coupled with S3 bucket policies it's possible to create some very powerful but very simple access restrictions on those objects.

The bucket policy for S3 buckets where we enable antivirus scanning has the following rules,

Objects added to a bucket must contain a tag ScanResult=UNSCANNED.
Objects with ScanResult=UNSCANNED cannot be returned in a GetObject call made by a Kubernetes pod other than the antivirus pod.
Objects with ScanResult=INFECTED cannot be returned in a GetObject call made by any Kubernetes pod.

We enforce these rules through conditions that check the "aws:PrincipalArn" such that they only apply to Kubernetes pods so we don't lock ourselves out in case of a break-glass emergency.

The requirement to have the ScanResult=UNSCANNED tag allows us to easily differentiate between objects that were created before we implemented the antivirus solution and didn't allow PDF upload. This way we can confidently allow GetObject calls on objects with no ScanResult tag.

Alternative S3 Solution

We explored the idea of having a dedicated "Unscanned Objects" bucket that all new objects get written to which only the antivirus services could read from which would move the clean objects into a 'target' bucket. This would limit the exposure of malicious files to a single bucket but ultimately would require a fairly big refactor for all of our services to get them to write to this bucket.

The solution we with went with is almost entirely transparent as it just required us to add the ScanResult=UNSCANNED tag on new objects which was easy to do with shared library code.

Continuous Monitoring

Hopefully malicious uploads are a very rare occurence which could make it hard to know that the scanning solution is actually working as intended - no news is good news? We don't want to assume everything's working (and we don't have Homer Simpson's everything is ok alarm), but we do have a continuous testing tool that we can 'system-verification' that you can read all about here.

We added a test case to the system-verification that uploads a malicious file and asserts that it was detected as infected, quarantined and isn't accessible.

We don't actually upload a malicious file, we upload a 'fake' virus known the EICAR test file which all antivirus software is programmed to recognise. This lets us test the system without risk of harm and we can tune our alerting to check for this

Alternative Solutions

There were a few different approaches to take here. At Countingup we work iteratively and data-informed throughout a project, happy to pivot on solutions if they aren't appropriate and we followed this approach for this project.

Stateless Scanning with AWS Lambda

There is a lot of prior art on the internet around building antivirus solutions with AWS Lambda and you may be tempted, as we were, to follow suit.

At the start of this project, we explored using a simpler solution to trigger a Lambda function directly from S3. We would still use ClamAV, but instead of clamd, we would use its one-off scanning tool clamscan.

The benefits here are:

Simple pipeline, only consisting of a Lambda function and trigger.
Isolated environment. No antivirus code or potentially infected files would go anywhere near our Kubernetes cluster.
Scales with demand.

Unfortunately, the deal-breaker for this solution was the performance. From the start, we were aware there would be some delay while files were being scanned and had metrics in place to capture this. We aimed to get a working solution out ASAP and start tweaking the configurable options, like function memory, to optimise the performance.

What they don't tell you

clamscan loads the virus definitions into memory each time it performs a one-off scan.

In practice we found this would take a minimum of 20 seconds. Once the definitions had been loaded, the actual time to scan a file was in the milliseconds.

Running ClamAV as a daemon still requires this startup time, but subsequent scans are 🔥 blazing fast 🔥.

A delay of 20 seconds will fit some use-cases but this would have had an impact on transaction attachments and limited any other future work.

BucketAV

Another option was to use an off-the-shelf solution like BucketAV.

BucketAV uses ClamAV too so it should provide the same level of protection. We quickly weighed up the pros and cons of using this and decided it wasn't suitable for us, mostly due to the cost.

Cost: BucketAV has you run a dedicated EC2 instance which would be fairly costly for us as we run a very lean infrastructure currently. There's also a standard SaaS price for BucketAV which adds up.
Flexibility: Building our own solution enabled us to define exactly how it works. We can choose what files to scan, how to handle infected files, how to get alerted etc.

Conclusion

We went live with this feature at the end of May and our business owners have started to take advantage of PDF uploads.

956 PDFs uploaded as transaction evidence
6% of all transaction attachments are PDFs
15% of business owners that upload transaction evidence have uploaded at least one PDF

What about performance?

Here's a chart showing the scanning duration with a few different percentiles. You can see that half of all uploads take less than 10ms and almost all uploads are scanned within 25ms. Significantly faster than the 20 seconds taken using a Lambda Function.

What's next?

The original project focused on transaction attachments but now it's so simple to enable scanning on any bucket we quickly followed up with allowing business owners to upload PDFs during sign-up which is now the most popular option and greatly speeds up the sign-up process.

There's plenty of scope for allowing PDF uploads in other parts of the app and we could easily allow other file types in the future.