This article is a guide how to deploy a containerized llama3.2 LLM in GKE - Google Kubernetes Engine. We will deploy a service running alpine/llama3.2:latest container, expose it to the internet and demonstrate communication with it using the curl command.

Why should you care?

Deploying an LLM locally without depending on 3rd party APIs has many benefits one should consider:

  1. Protection of intellectual property (IP) and privacy. There have been some cases where IP and code were used by the 3rd party to the detriment of the client [5], [6], [7]
  2. Deploying locally without the necessity for an internet connection - run the LLM locally and use a local API.
  3. Avoid high costs of renting expensive GPUs.
  4. Apply security measures granularly tailored to your specific needs.
  5. Avoid paying an external party for each use.

What is Llama 3.2: Llama 3.2 represents Meta's latest advancement in large language models, offering a diverse range of capabilities to suit various AI applications. This collection of models comes in four sizes: 1 billion, 3 billion, 11 billion, and 90 billion parameters. The smaller 1B and 3B models are text-only and designed for efficient operation on edge devices and mobile applications, enabling local processing without relying on cloud infrastructure. In contrast, the larger 11B and 90B models are multimodal, capable of processing both text and images, marking the first time Llama models have incorporated vision capabilities. All Llama 3.2 models boast an impressive 128K token context length, allowing them to handle extensive input sequences. These models also offer improved multilingual support for eight languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. With its range of sizes and capabilities, Llama 3.2 is poised to power a wide array of AI applications, from content creation and conversational AI to visual reasoning and code generation.

Why Llama 3.2?

  1. GPU independent
  2. Open source and free
  3. Can be deployed locally
  4. It generates code
  5. It is multilingual
  6. It is available in a containerized version