JackAI Hub - AI Resources, Tutorials, Tools & Open Source Projects

Introduction

Deploying LLMs in production requires proper containerization and orchestration. This guide covers Docker and Kubernetes deployment.

Docker Deployment

Dockerfile

FROM python:3.10-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000

CMD ["python", "server.py"]

Simple API Server

from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")

@app.post("/generate")
async def generate(prompt: str):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=100)
    return {"response": tokenizer.decode(outputs[0])}

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-server
  template:
    spec:
      containers:
      - name: llm-server
        image: llm-server:latest
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1

Resources

Source: JackAI Hub

Deploying LLMs with Docker and Kubernetes

Introduction

Docker Deployment

Dockerfile

Simple API Server

Kubernetes Deployment

Resources