How to containerize and deploy a Hugging Face model

Abstract:

This article addresses the challenges of containerizing and deploying a Large Language Model (LLM). The main challenges include: the substantial disk space required by these models, the time-consuming process of pushing and pulling them, the lengthy build times, and the need to select a model with a permissive license for use and deployment in production without legal restrictions.

https://github.com/bitsector/containerised-hf-zephyr-7b-beta

Why should you care?

Deploying an LLM locally without depending on 3rd party APIs has many benefits one should consider:

Protection of intellectual property (IP) and privacy. There have been some cases where IP and code were used by the 3rd party to the detriment of the client [5], [6], [7].
Deploying locally without the necessity for an internet connection - run the LLM locally and use a local API.
Avoid high costs of renting expensive GPUs.
Apply security measures granularly tailored to your specific needs.
Avoid paying an external party for each use.

Why I chose the Zephyr-7b-beta model:

It's large - 7b represents 7 billion parameters
It's a text generation model - enabling chat demonstrations
It's under an MIT License - the least restrictive license available

I will outline the steps to build, deploy, test, and expose the model's chat capability.

Containerizing the app: The challenge

Suppose we want to use Zephyr-7b-beta as a chatbot. We wrap it with a simple Python Flask web app, install pip dependencies, package everything into a container, build, push to Docker Hub, and deploy as a web app. What could possibly go wrong?