Introduction
Deploying LLMs in production requires proper containerization and orchestration. This guide covers Docker and Kubernetes deployment.
Docker Deployment
Dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["python", "server.py"]
Simple API Server
from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
app = FastAPI()
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")
@app.post("/generate")
async def generate(prompt: str):
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
return {"response": tokenizer.decode(outputs[0])}
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-server
spec:
replicas: 3
selector:
matchLabels:
app: llm-server
template:
spec:
containers:
- name: llm-server
image: llm-server:latest
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
Resources
Source: JackAI Hub