AI Tutorials tutorial

Deploying LLMs with Docker and Kubernetes

Docker Kubernetes LLM Deployment Tutorial

Introduction

Deploying LLMs in production requires proper containerization and orchestration. This guide covers Docker and Kubernetes deployment.

Docker Deployment

Dockerfile

FROM python:3.10-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000

CMD ["python", "server.py"]

Simple API Server

from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")

@app.post("/generate")
async def generate(prompt: str):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=100)
    return {"response": tokenizer.decode(outputs[0])}

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-server
  template:
    spec:
      containers:
      - name: llm-server
        image: llm-server:latest
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1

Resources

Source: JackAI Hub