GitHub Projects project

03 Ai Model Deployment Guide

Complete Guide to AI Model Deployment: From Local to Production

Published: March 6, 2026
Reading Time: 18 minutes
Word Count: 3,200 words


Introduction

With the explosive growth of open source AI models, how to efficiently deploy these models has become a core challenge for developers. This article systematically introduces the complete process of AI model deployment, covering local development, cloud deployment, edge computing, and other scenarios to help you choose the most suitable deployment solution.


Deployment Options Overview

Deployment Method Use Case Cost Complexity Latency
Local Deployment Development/testing, privacy-sensitive Low Low Very low
Cloud GPU Production, high concurrency Medium-High Medium Low
Serverless Elastic needs, intermittent use On-demand Low Medium
Edge Deployment IoT, real-time applications Medium High Very low
Hybrid Deployment Complex business scenarios Variable High Variable

I. Local Deployment Solutions

1.1 Ollama - Easiest Local Deployment

Use Cases: Personal development, rapid prototyping, privacy protection

Installation and Usage

# macOS/Linux installation
curl -fsSL https://ollama.com/install.sh | sh

# Windows download
# https://ollama.com/download/windows

# Pull and run model
ollama pull llama3.2
ollama run llama3.2

# List installed models
ollama list

# Remove model
ollama rm llama3.2
```text

#### REST API Calls

```python
import requests
import json

# Generate text
response = requests.post('http://localhost:11434/api/generate', 
    json={
        'model': 'llama3.2',
        'prompt': 'Explain the basic principles of quantum computing',
        'stream': False
    })

result = response.json()
print(result['response'])

# Chat API
response = requests.post('http://localhost:11434/api/chat',
    json={
        'model': 'llama3.2',
        'messages': [
            {'role': 'user', 'content': 'Hello'}
        ],
        'stream': False
    })

Custom Models

# Create Modelfile
FROM llama3.2

# System prompt
SYSTEM """You are a professional Python programming assistant, proficient in data analysis and machine learning."""

# Parameter settings
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40

# Add fine-tuning adapter (optional)
ADAPTER ./my-lora-adapter.gguf
```text

```bash
# Create custom model
ollama create my-assistant -f Modelfile

# Run custom model
ollama run my-assistant

1.2 LM Studio - GUI Management Tool

Use Cases: Non-technical users, model comparison testing

Core Features

  • Model Browser: One-click download of Hugging Face models
  • Chat Interface: ChatGPT-like interactive experience
  • Local Server: Provides OpenAI-compatible API
  • Multi-model Management: Manage multiple models simultaneously

Usage Workflow

1. Download and install LM Studio
2. Browse and download models (supports GGUF format)
3. Load model and start chatting
4. Launch local server (port 1234)
5. Call using OpenAI SDK

API Call Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "user", "content": "Hello"}
    ]
)
print(response.choices[0].message.content)
```text

---

### 1.3 llama.cpp - Extreme Performance Optimization

**Use Cases:** Resource-constrained environments, high-performance requirements

#### Build and Install

```bash
# Clone repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build (CPU version)
make

# Build (CUDA version)
make LLAMA_CUDA=1

# Build (Metal version - macOS)
make LLAMA_METAL=1

Model Conversion and Quantization

# Convert Hugging Face model to GGUF
python convert_hf_to_gguf.py \
    --outfile model-f16.gguf \
    --outtype f16 \
    ./model-directory

# Quantize model (4-bit)
./llama-quantize model-f16.gguf model-q4_0.gguf q4_0

# Available quantization types:
# q4_0, q4_1, q5_0, q5_1, q8_0
# q2_k, q3_k_s, q3_k_m, q3_k_l
# q4_k_s, q4_k_m, q5_k_s, q5_k_m, q6_k
```text

#### Running Inference

```bash
# Basic inference
./llama-cli \
    -m model-q4_0.gguf \
    -p "Explain basic machine learning concepts" \
    -n 256 \
    --temp 0.7

# Interactive mode
./llama-cli \
    -m model-q4_0.gguf \
    --interactive \
    --color

# Start server
./llama-server \
    -m model-q4_0.gguf \
    --host 0.0.0.0 \
    --port 8080

II. Cloud Deployment Solutions

2.1 vLLM - Production-grade Inference Engine

Use Cases: High-concurrency services, production environments

Installation

# Basic installation
pip install vllm

# CUDA 12.1
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

# Install from source
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
```text

#### Basic Usage

```python
from vllm import LLM, SamplingParams

# Load model
llm = LLM(
    model="meta-llama/Llama-3.2-8B",
    tensor_parallel_size=2,  # Use 2 GPUs
    gpu_memory_utilization=0.9
)

# Set sampling parameters
sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=200
)

# Batch inference
prompts = [
    "Explain Python decorator principles",
    "What is deep learning?",
    "How to optimize SQL queries?"
]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt}")
    print(f"Generated: {generated_text}\n")

Launch API Service

# Start OpenAI-compatible API server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-8B \
    --tensor-parallel-size 2 \
    --host 0.0.0.0 \
    --port 8000

# Parameter explanation
# --model: Model name or path
# --tensor-parallel-size: GPU parallelism
# --pipeline-parallel-size: Pipeline parallelism
# --max-num-seqs: Maximum concurrent sequences
# --max-model-len: Maximum sequence length
```text

#### Client Calls

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

# Chat completion
response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-8B",
    messages=[
        {"role": "user", "content": "Hello"}
    ]
)

# Text completion
response = client.completions.create(
    model="meta-llama/Llama-3.2-8B",
    prompt="Explain neural networks"
)

2.2 Text Generation Inference (TGI)

Use Cases: Hugging Face ecosystem, enterprise deployment

Docker Deployment

# Run TGI container
model=meta-llama/Llama-3.2-8B
volume=$PWD/data

docker run --gpus all \
    --shm-size 1g \
    -p 8080:80 \
    -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:2.0 \
    --model-id $model \
    --num-shard 2
```text

#### Python Client

```python
from text_generation import Client

client = Client("http://localhost:8080")

# Text generation
text = client.generate(
    "Explain machine learning",
    max_new_tokens=200,
    temperature=0.7
).generated_text

# Streaming generation
for response in client.generate_stream(
    "Explain deep learning",
    max_new_tokens=200
):
    print(response.token.text, end="")

2.3 Cloud Platform Deployment

AWS SageMaker

import boto3
from sagemaker.huggingface import HuggingFaceModel

# Configure role and session
role = "arn:aws:iam::account-id:role/SageMakerRole"
sess = boto3.Session()

# Create Hugging Face model
huggingface_model = HuggingFaceModel(
    model_data='s3://my-bucket/model.tar.gz',
    role=role,
    transformers_version='4.28',
    pytorch_version='2.0',
    py_version='py310'
)

# Deploy endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type='ml.g5.2xlarge'
)

# Inference
result = predictor.predict({
    'inputs': 'Explain artificial intelligence'
})
```text

#### Google Cloud Vertex AI

```python
from google.cloud import aiplatform

# Initialize
aiplatform.init(project='my-project', location='us-central1')

# Deploy model
model = aiplatform.Model.upload(
    display_name='llama-3-2',
    artifact_uri='gs://my-bucket/model',
    serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/pytorch-gpu.2-0:latest'
)

# Create endpoint
endpoint = model.deploy(
    machine_type='n1-standard-4',
    accelerator_type='NVIDIA_TESLA_T4',
    accelerator_count=1
)

III. Containerization and Orchestration

3.1 Docker Deployment

Dockerfile Example

FROM nvidia/cuda:12.1-devel-ubuntu22.04

# Install dependencies
RUN apt-get update && apt-get install -y \
    python3-pip \
    python3-dev \
    git \
    && rm -rf /var/lib/apt/lists/*

# Install Python packages
RUN pip3 install --no-cache-dir \
    vllm \
    transformers \
    accelerate

# Copy application code
COPY . /app
WORKDIR /app

# Expose port
EXPOSE 8000

# Start command
CMD ["python3", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "meta-llama/Llama-3.2-8B", \
     "--host", "0.0.0.0", \
     "--port", "8000"]
```text

#### Build and Run

```bash
# Build image
docker build -t my-llm-server .

# Run container
docker run --gpus all \
    -p 8000:8000 \
    --shm-size=16gb \
    my-llm-server

# Use docker-compose
docker-compose up -d

3.2 Kubernetes Deployment

Deployment Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-server
  template:
    metadata:
      labels:
        app: llm-server
    spec:
      containers:
      - name: llm
        image: my-llm-server:latest
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: model-storage
          mountPath: /models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: llm-server
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer
```text

#### Deployment Commands

```bash
# Apply configuration
kubectl apply -f deployment.yaml

# Check status
kubectl get pods
kubectl get svc

# Scale
kubectl scale deployment llm-deployment --replicas=3

IV. Performance Optimization

4.1 Quantization Techniques

Quantization Type Precision Loss VRAM Savings Speed Improvement
FP16 Minimal 50% 1.5-2x
INT8 Small 75% 2-3x
GPTQ-4bit Medium 75% 3-4x
AWQ-4bit Medium 75% 3-4x
GGUF-Q4 Medium 75% 2-3x

AWQ Quantization Example

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-3.2-8B"
quant_path = "llama-3.2-8b-awq"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4}

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
```text

---

### 4.2 Inference Optimization

#### Batch Processing Optimization

```python
from vllm import LLM, SamplingParams

# Dynamic batching
llm = LLM(
    model="meta-llama/Llama-3.2-8B",
    max_num_seqs=256,  # Maximum concurrent requests
    max_num_batched_tokens=4096  # Maximum batch tokens
)

# Continuous batching
# vLLM handles automatically, no additional configuration needed

Cache Optimization

# KV Cache optimization
llm = LLM(
    model="meta-llama/Llama-3.2-8B",
    gpu_memory_utilization=0.95,  # GPU memory utilization
    swap_space=4,  # CPU swap space (GB)
    enforce_eager=False  # Enable CUDA graph optimization
)
```text

---

## V. Monitoring and Operations

### 5.1 Performance Monitoring

#### Prometheus + Grafana

```python
from prometheus_client import Counter, Histogram, start_http_server
import time

# Define metrics
request_count = Counter('llm_requests_total', 'Total requests')
request_duration = Histogram('llm_request_duration_seconds', 'Request duration')
tokens_generated = Counter('llm_tokens_generated_total', 'Tokens generated')

# Start metrics server
start_http_server(9090)

# Record in inference code
@request_duration.time()
def generate_text(prompt):
    request_count.inc()
    result = model.generate(prompt)
    tokens_generated.inc(len(result.tokens))
    return result

5.2 Log Management

import logging
import json

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger('llm-server')

# Structured logging
def log_request(request_id, prompt, response, latency):
    logger.info(json.dumps({
        'request_id': request_id,
        'prompt_length': len(prompt),
        'response_length': len(response),
        'latency_ms': latency,
        'timestamp': time.time()
    }))
```text

---

## VI. Security and Compliance

### 6.1 Access Control

```python
from fastapi import FastAPI, HTTPException, Security
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

app = FastAPI()
security = HTTPBearer()

API_KEYS = {"sk-abc123", "sk-def456"}

async def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
    if credentials.credentials not in API_KEYS:
        raise HTTPException(status_code=403, detail="Invalid API key")
    return credentials.credentials

@app.post("/v1/chat/completions")
async def chat_completion(api_key: str = Security(verify_token)):
    # Process request
    pass

6.2 Content Filtering

```python

Input filtering

FORBIDDEN_PATTERNS = [ r'(?i)(hack|exploit|attack)', # Security related r'(?i)(credit.?card|ssn|password)', # Sensitive information ]

def filter_input(text): for pattern in FORBIDDEN_PATTERNS: if re.search(pattern, text): raise ValueError("Input contains forbidden content") return text

Output filtering

def filter_output(text): # Use content moderation API moderation_result = openai.moderations.create(input=text) if moderation_result.results[0].flagged: raise ValueError("Output flagged by moderation") return text ```text


VII. Best Practices Summary

7.1 Deployment Checklist

  • [ ] Model loaded correctly
  • [ ] API interface tested
  • [ ] Performance benchmark completed
  • [ ] Monitoring and alerting configured
  • [ ] Log collection configured
  • [ ] Security policies implemented
  • [ ] Backup and recovery plan ready
  • [ ] Scaling plan prepared

7.2 Cost Optimization Recommendations

  1. Auto-scaling: Use K8s HPA or cloud vendor auto-scaling
  2. Model Quantization: Use 4-bit quantization to reduce VRAM usage
  3. Batch Processing: Set appropriate batch size to improve throughput
  4. Caching Strategy: Cache common requests
  5. Hot/Warm Separation: Use Serverless for low-frequency models

7.3 Troubleshooting

Issue Possible Cause Solution
OOM Error Insufficient VRAM Reduce batch size, use quantized models
Slow Response High concurrency Add instances, optimize batching
Model Load Failed Path error Check model path and permissions
API Unresponsive Port conflict Check port usage

Conclusion

AI model deployment is a complex engineering task involving multiple technology stacks. Choosing the right deployment solution requires comprehensive consideration of performance, cost, and complexity.

Recommended Path: 1. Development Phase: Use Ollama or LM Studio for local testing 2. Testing Phase: Use vLLM single-machine deployment 3. Production Phase: Use K8s + vLLM cluster deployment 4. Large-scale Scenarios: Consider dedicated inference services or cloud vendor solutions

As technology rapidly evolves, deployment solutions are constantly improving. It is recommended to continuously follow the latest developments in vLLM, TGI, and other projects, and adopt new optimization technologies in a timely manner.


Keywords: AI Model Deployment, vLLM, Ollama, llama.cpp, Model Quantization, Cloud Deployment, Kubernetes, Inference Optimization

Related Reading: - Top 10 AI Open Source Projects 2026 - AI Agent Development Guide - Python Crawler Tutorial

Last Updated: March 6, 2026