Serverless Deployment of Deep Learning Models

Host your ML models for a few cents a month.

Posted by Alexander Meinke on March 16, 2022 · 12 mins read


When taking a deep learning project from theory to practice at some point one has to decide how and where to deploy the model. Buying or renting the necessary server infrastructure can quickly get very pricy, especially when the model is only called at infrequent and irregular times.

Luckily, in the cloud it is possible to pay only precisely for the resources that we need, when we need them. To demonstrate how this can be done, I decided to host an API that takes in an image and uses deep learning to generate a description of the image. Try it out below!

70% Complete

Upload Image

In this post, I will walk you through the different design considerations that go into coming up with a solution to this. If you don't care and simply want to know how to implement the solution yourself, feel free to skip straight to the code on github .

Serverless Hosting

The obvious way of hosting a machine learning model on AWS is to provision an EC2 instance and to set up a REST-API that gets served through some framework like Flask , which calls the machine learning model in the backend. The advantage of this approach is that it is quite easy to develop and deploy. With additional infrastructure like Auto Scaling and Elastic Load Balancers this can even be made to scale to a large number of requests.

So what's the catch? The price tag. If we expect that our hosted model only receives requests at infrequent and unpredictable times, then we may not want to leave our EC2 instance running 24/7. In addition, if our traffic patterns come in very sudden bursts of unexpected parallel requests, then we would even have to keep a powerful (and expensive) instance type running at all times. Even if we setup auto-scaling it will be to slow at booting up new machines (on the order of minutes) so it could miss the spike in traffic entirely. On top of all of this, we need to manage the instances ourselves by keeping the OS updated and ensuring our server is not exposing any security vulnerabilities to world. Luckily, there is a solution: Serverless deployments using AWS Lambda !

Of course, serverlessly deploying something does not mean that no servers are involved. It simply means that we don't have to manage them ourselves. No taking care of operation system patches or security groups or anything like that. We simply upload our code and we only pay for the time that the code actually spends running. If we suddenly need to serve multiple requests simultaneously, then AWS automatically handles the scaling quickly and quietly. Making the Lambda function available to the internet can be achieved quite easily via another serverless AWS service: API Gateway .

Trouble in Paradise

Unfortunately, AWS Lambda comes with a few caveats. First of all, we will not have access to GPU-acceleration so we shouldn't try to host huge models and expect millisecond latency. Secondly, the size of the deployment packages for AWS Lambda is limited to 50MB for the zipped deployment package and 250MB unzipped, including Lambda Layers (Lambda Layers are a way to package libraries and dependencies). Even if our model is smaller than 250MB, we will not be able to fit all of our dependencies within these limits.

A simple way to offload storage away from Lambda would be to host our model weights on S3 and then download and store it to RAM as soon as the Lambda function is invoked. Unfortunately, this would mean that we have to download the same file at the start of every function call which would lead to unnecessary latencies. What we can do instead is store our model on Amazon Elastic File System (EFS) and allow the lambda function to read data from there. You can find excellent step-by-step guides on how to host ML models this way here and here .

Confidence histogram
API Gateway calls the Lambda function which loads the model and dependencies from the EFS.

This solution, however, is quite complex. Simply pushing a new model to the EFS is already cumbersome because mounting the EFS can only be done either through an EC2 instance inside the same VPC or a VPN, both of which add cost and complexity. Additionally, any code changes to the model have to occur both in the Lambda function as well as the pickled model stored on the EFS. Avoiding downtime in this case would require an inconvenient amount of DevOps duct tape.

Luckily, AWS provides another much more elegant solution to the same problem: we can simply run a Lambda function straight from a Docker container. The setup is also simple and has excellent documentation . Basically, all we have to do is package our Lambda function in a Docker image and push this image into Elastic Container Registry (ECR) .

Confidence histogram
API Gateway calls the Lambda function which loads a Docker container from ECR.

Cost, Latency, Complexity

Given these two approaches, which one is better? In terms of complexity, the solution using docker is clearly more elegant and easier to maintain. But what about latency? What about cost? I am going to assume pretty low traffic because that is precisely the scenario where serverless has a leg up over simple EC2 instances. Let's assume maybe up to 10,000 requests per month. So how expensive is a single request? Of course, that depends on the model. I decided to simply host a pre-trained model since my main interest was in the deployment aspect.

When running our model, the Lambda function tends to use around 700-800MB of RAM. That means we should select the next larger increment of 1024MB when configuring our Lambda function. Now when calling both the ECR and the EFS hosted solution for the first time, they have to cold-start. Because of this, latency will be quite bad. I measured an average latency of 27.3s for EFS and 32.2s for docker when running cold-starts. Luckily, after a cold-start AWS stores the loaded code in a ready state that means the next invocations will be much faster for some non-deterministic amount of time. When running from a warm-start, the average latency is basically identical - 4.7s for the EFS and 4.8s for the docker.

However, I have noticed the following: for the docker based solution there is a third possibility, a luke-warm-start, if you will. The average latency for those was around 10.8s. When testing both setups at random times throughout several days I could see that often when the EFS had cooled all the way to a cold-start, the docker was usually still in the luke-warm state that allows for much faster bootups. While the details will heavily depend on the precise traffic patterns, my impression is that the docker based setup is preferable as long as one can manage to keep it "luke-warm" most of the time.

In terms of cost, let's just assume that both solutions have basically the same average runtime per invocation. Again, this depends a lot on traffic patterns but I will just go with an even 10 seconds. How much will we pay for both solutions? Based on the prices for the region eu-central-1 the 10,000 lambda invocations would come out to cost around $1.67 a month. For both the ECR and the EFS the storage that we need including all dependencies is just over 2GB. Luckily, transferring data between ECR, EFS and Lambda within the same region is completely free so we only pay for the storage. Assuming one zone availability is enough for our EFS the cost comes out around $0.40 per month for the EFS and around $0.20 a month for the ECR. At less than $0.04 per month, the API Gateway would cost us a neglible amount.


Ultimately, since the latency and cost of both approaches is quite similar, I would certainly recommend using the docker-based solution simply because it is far easier to set up and maintain. The fact that the luke-warm-starts can sometimes improve latencies is just a nice bonus. If you are interested in setting up a solution like this as well, check out my github repo where I show step-by-step how to host this same model using ECR.

If we wanted to improve our solution and push down both the latencies and the cost, I think it would be most effective to try to remove dependencies from our code. Right now we are using torch and numpy and Pillow and torchvision, which is definitely overkill. Cutting out dependencies would reduce the size of our deployment package and certainly make our application both cheaper and faster.