Navigate back to the homepage

Deploying PyTorch to Production

Stefan Libiseller
December 12th, 2019 · 2 min read

When I started with data science it was pretty hard to find material on how to deploy the models I had just trained. It seemed like everyone was busy blogging about fancy new optimizers or the latest model architecture. Sure, deploying machine learning models is one of the less fancy sides of data science, but in my opinion equally important. After all, what good is the best machine learning model if you can't use it?

In this post I'll show how to build an easy, self hosted web API for small to medium volume production using Flask. For more power and auto-scalability I recommend Cortex, a platform that deploys models to AWS with very little effort. I intentionally won't cover deployment using TorchScript as this could be another blog post on its own.

Flask Web API

Python frameworks like Flask make it easy to create simple web APIs. While this is not the most performant method to deploy PyTorch models, it is ideal for small to medium volume production. It's quick to adapt and deploy, which is often more important than performance.

I highly suggest to use virtual environments like conda to keep the dependencies of different projects separate. For this API you'll need to install Flask and Gunicorn to your environment:

1conda install flask gunicorn

The code below is the whole application, you just need to insert your model and the predict function. Also, don't forget to set your model to evaluation mode - no need to compute gradients in inference mode!

1# Stefan Libiseller
4from flask import Flask, json, request
6app = Flask(__name__)
7api_endpoint = "/my_endpoint"
9# load your model here
11def predict(message):
12 # do your pytorch magic here
13 # return as dict
14 return {'class1': 0.3, 'class2': 0.7}
17@app.route(api_endpoint, methods=['GET'], endpoint=api_endpoint)
18def api():
19 req_data = request.get_json()
20 message = req_data['message']
22 try:
23 response, status = predict(message), 200
24 except Exception as e:
25 response, status = {"error": str(e)}, 500
27 return app.response_class(
28 response=json.dumps(response),
29 status=status,
30 mimetype='application/json'
31 )
34if __name__ == '__main__':
35, host='', port=5000)

Essentially this code just grabs the message field from the request, puts it trough the predict function and returns the response.

Tip: You can raise exceptions in the predict function and the API will return an error with the exception message. This lets you implement assertions or see exactly what piece of code failed.


You can test it by executing the file, which should give you an output similar to this...

3* Serving Flask app "api" (lazy loading)
4* Environment: production
5 WARNING: This is a development server. Do not use it in a production deployment.
6 Use a production WSGI server instead.
7* Debug mode: off
8* Running on (Press CTRL+C to quit)

Don't be confused about the warning, we'll fix it in a minute. Since we are only trying to test our application code, python is correctly warning us about it not running in a production-ready way.

To submit test requests to the API I recommend using Postman. You can import my preconfigured request by copy and pasting this link:

Requests need to have an application/json header and a body with a message field like this:

2 "message": "literally anything"


To start the application in production use this command:

1gunicorn api:app

Gunicorn is a pre-fork worker model, which essentially means it can handle multiple requests at once. So make sure to use it when deploying.

Be aware that big models are also going to consume a lot of RAM. Make sure to check if you have enough availabe before deplying! Usually the size of the loaded weights is also required in RAM.

Cortex Auto Scale AWS Deployment

If you don't have your own server or need something that scales with demand I recommend Cortex. It's an open-source platform that essentially does the same as the API above, but on an auto-scalable AWS instance that is also GPU accelerable.

They have great documentation and tutorials on their website. I suggest to also take a look at one of their examples. It helped me a lot to understand how it works in practice.


These are just two examples of how PyTorch models can be deployed to production with little to no pain. The Flask API is a great start and Cortex can take it to the next level, if your application requires it. What we didn't cover are optimizations techniques to reduce model size such as quantisation and pruning close-to-zero weights as they only become necessary if you want to deploy to resource restricted environments like smart phones.

Happy shipping!

Other Links:
ONNX - Open neural network exchange format

Let's Automate Your Business!

I create custom AI models for small to medium-sized companies who want to make their products stand out with deep learning. If you are interested in collaborating, send me an email or talk to me directly in a free video call!

Schedule free video call

Join My Newsletter

Get notified about new blog posts and news about what I've been doing lately. You can opt-out at any time and I promise not to spam your inbox or share your email with anyone else.

More blog posts

How Machines Understand Words

Words for humans to understand words for machines. A summary with code examples about word embeddings, word vectors and byte pair encoding.

August 14th, 2019 · 3 min read

Machine Learning for Humans in a Hurry

A compact summary of what machine learning is and how it works.

July 9th, 2019 · 3 min read
© 2020 - Stefan LibisellerImprint
Link to $ to $ to $