AI 基础设施运维实战指南

1. AI 基础设施全景图

┌─────────────────────────────────────────────────────────┐
│                    用户/业务层                            │
│  Chat UI / API Gateway / 企业内部应用                     │
├─────────────────────────────────────────────────────────┤
│                    推理网关层                             │
│  LiteLLM / vLLM Proxy / Nginx → 路由/限流/鉴权          │
├─────────────────────────────────────────────────────────┤
│                    模型服务层                             │
│  vLLM | TGI | Triton | Ollama  (推理)                    │
│  Kubeflow PyTorchJob/TFJob      (训练)                   │
├─────────────────────────────────────────────────────────┤
│                    存储层                                 │
│  模型仓库(S3/MinIO) | 向量数据库(Milvus/Qdrant)         │
│  训练数据(PVC/CEPH) | Checkpoint(对象存储)               │
├─────────────────────────────────────────────────────────┤
│                    K8s 调度层 (GPU Operator)              │
│  GPU Device Plugin | MIG | Time-slicing | Topology       │
├─────────────────────────────────────────────────────────┤
│                    硬件层                                 │
│  NVIDIA A100/H100/L40S 集群 | InfiniBand/RoCE 网络       │
└─────────────────────────────────────────────────────────┘

作为一名运维工程师，你需要管好从硬件（GPU 健康）到服务（推理延迟）的每一层。

2. GPU 集群规划与选型

2.1 GPU 型号对比与选型

GPU	显存	显存带宽	FP16 TFLOPS	NVLink	适合场景	云参考价(/h)
A100-40GB	40GB	1555 GB/s	312	600 GB/s	训练+推理	~$1.5
A100-80GB	80GB	2039 GB/s	312	600 GB/s	大模型训练	~$2.5
H100-80GB	80GB	3350 GB/s	989	900 GB/s	最新训练	~$4.0
L40S	48GB	864 GB/s	91.6	无	推理+微调	~$1.0
A10	24GB	600 GB/s	31.2	无	小模型推理	~$0.6
T4	16GB	320 GB/s	8.1	无	轻量推理	~$0.4

2.2 推理场景选型公式

所需显存 ≈ 模型参数量 × 精度字节数 × 1.2（KV Cache 开销）

示例:
- Qwen2.5-7B f16:  7B × 2B × 1.2 ≈ 16.8GB → A10 或 A100 40GB 单卡
- Qwen2.5-72B f16: 72B × 2B × 1.2 ≈ 173GB → A100 80GB × 4 张（tensor-parallel）
- Llama-3-70B int4: 70B × 0.5B × 1.2 ≈ 42GB → A100 40GB 或 L40S 单卡

2.3 网络要求

场景	带宽需求	推荐连接
单卡推理	无特殊要求	标准以太网
多卡推理（TP=4）	> 100 GB/s GPU间	NVLink
单机8卡训练	> 300 GB/s GPU间	NVLink + NVSwitch
多机多卡训练	> 200 Gbps 节点间	InfiniBand HDR/NDR 或 RoCE

3. 推理网关与流量管理

3.1 LiteLLM Proxy（推荐）

LiteLLM 作为统一推理网关，屏蔽不同模型后端的差异：

# litellm-config.yaml
model_list:
  - model_name: qwen-7b
    litellm_params:
      model: openai/qwen-7b
      api_base: http://vllm-qwen.default.svc:8000/v1
      api_key: "sk-dummy"
  - model_name: llama-70b
    litellm_params:
      model: openai/llama-70b
      api_base: http://vllm-llama.default.svc:8000/v1

router_settings:
  routing_strategy: "usage-based"       # 负载均衡
  allowed_fails: 3
  num_retries: 2
  cooldown_time: 30                      # 后端故障冷却30s

litellm_settings:
  drop_params: true                       # 自动忽略不支持的参数
  set_verbose: false
  request_timeout: 600                    # 推理超时 (秒)
  max_parallel_requests: 1000

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY

3.2 K8s 部署 LiteLLM

apiVersion: apps/v1
kind: Deployment
metadata:
  name: litellm
spec:
  replicas: 2
  selector:
    matchLabels:
      app: litellm
  template:
    metadata:
      labels:
        app: litellm
    spec:
      containers:
      - name: litellm
        image: ghcr.io/berriai/litellm:main-v1.52
        args:
        - "--config"
        - "/app/config.yaml"
        - "--port"
        - "4000"
        ports:
        - containerPort: 4000
        env:
        - name: LITELLM_MASTER_KEY
          valueFrom:
            secretKeyRef:
              name: litellm-secret
              key: master-key
        volumeMounts:
        - name: config
          mountPath: /app/config.yaml
          subPath: config.yaml
        resources:
          limits:
            cpu: 2
            memory: 4Gi
          requests:
            cpu: 500m
            memory: 1Gi
      volumes:
      - name: config
        configMap:
          name: litellm-config
apiVersion: v1
kind: Service
metadata:
  name: litellm
spec:
  selector:
    app: litellm
  ports:
  - port: 4000

# 测试推理网关
# curl http://litellm.default.svc:4000/v1/chat/completions \
#   -H "Authorization: Bearer sk-your-key" \
#   -H "Content-Type: application/json" \
#   -d '{"model":"qwen-7b","messages":[{"role":"user","content":"你好"}]}'

3.3 推理限流策略

# LiteLLM 内置限流
litellm_settings:
  rpm_per_key: 100           # 每个 API Key 每分钟最多 100 请求
  rpm_per_api_key: 200       # 总速率限制

# 或使用 K8s Ingress + Envoy rate limiting
# 保护 GPU 推理后端不被突发流量打挂

4. 向量数据库运维

4.1 选型对比

数据库	定位	索引算法	GPU加速	K8s友好	推荐场景
Milvus	分布式向量库	IVF/HNSW/DiskANN	✅	⚠️ 组件多	千万级+ 生产RAG
Qdrant	轻量高性能	HNSW	❌	✅ 单二进制	百万级中小规模
Chroma	开发友好	HNSW	❌	✅ 极简	原型开发
Weaviate	多模态搜索	HNSW/Flat	✅	⚠️	混合搜索

4.2 Qdrant 部署（推荐入门）

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: qdrant
spec:
  serviceName: qdrant
  replicas: 1
  selector:
    matchLabels:
      app: qdrant
  template:
    spec:
      containers:
      - name: qdrant
        image: qdrant/qdrant:v1.11
        ports:
        - containerPort: 6333
          name: http
        - containerPort: 6334
          name: grpc
        resources:
          limits:
            memory: 8Gi
            cpu: 4
          requests:
            memory: 2Gi
            cpu: 1
        volumeMounts:
        - name: storage
          mountPath: /qdrant/storage
        livenessProbe:
          httpGet:
            path: /health
            port: 6333
  volumeClaimTemplates:
  - metadata:
      name: storage
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi

4.3 向量数据库运维 checklist

# 健康检查
curl http://qdrant:6333/health

# 查看集合状态
curl http://qdrant:6333/collections

# 查看集群信息
curl http://qdrant:6333/cluster

# 备份：Qdrant 支持快照
curl -X POST http://qdrant:6333/collections/my_collection/snapshots

# 监控关键指标
# - 索引构建时间
# - 查询延迟 (P50/P99)
# - 磁盘使用量
# - 向量数量

5. AI 服务可观测性

5.1 推理服务关键指标

# 需要采集的指标维度
维度1 - 模型推理性能:
  - TTFT (Time To First Token):       首个 token 延迟
  - TPOT (Time Per Output Token):     每个 token 生成时间
  - E2E latency:                      端到端延迟
  - Throughput:                       tokens/s / requests/s

维度2 - GPU 资源:
  - GPU 利用率、显存占用、温度         (DCGM)
  - GPU 计算利用率 / 显存带宽利用率    (DCGM)

维度3 - 服务质量:
  - 请求成功率
  - 排队等待时间
  - OOM 事件次数

5.2 vLLM 暴露的 Prometheus 指标

# vLLM metrics endpoint
curl http://vllm-qwen:8000/metrics

# 核心指标：
vllm:num_requests_running           # 正在处理的请求数
vllm:num_requests_waiting           # 排队等待的请求数
vllm:gpu_cache_usage_perc          # KV Cache 使用率
vllm:time_to_first_token_seconds   # TTFT 直方图
vllm:time_per_output_token_seconds # TPOT 直方图
vllm:request_success_total         # 成功请求数

5.3 告警规则示例

groups:
- name: ai_inference_alerts
  rules:
  - alert: HighQueueDepth
    expr: vllm:num_requests_waiting > 100
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "模型 {{ $labels.model_name }} 排队请求 > 100，可能需要扩容"

  - alert: HighTTFT
    expr: histogram_quantile(0.95, vllm:time_to_first_token_seconds) > 3
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "P95 TTFT > 3s，用户感知延迟明显"

  - alert: OOMEvents
    expr: increase(vllm:request_failure_total{reason="OOM"}[10m]) > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "推理 OOM，可能是并发过高或模型显存不足"

  - alert: KVCacheNearFull
    expr: vllm:gpu_cache_usage_perc > 0.95
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "KV Cache 接近满载，请求可能被阻塞"

6. 常见故障排查

6.1 GPU Pod 无法启动

# 现象
kubectl describe pod gpu-pod | grep Events
# "0/3 nodes are available: 3 Insufficient nvidia.com/gpu"

# 排查步骤
# 1. 确认节点有 GPU
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, gpu: .status.capacity."nvidia.com/gpu"}'

# 2. 确认 Device Plugin 正常
kubectl get pods -n kube-system | grep nvidia-device-plugin

# 3. 确认 GPU 未被其他 Pod 占用
kubectl describe node <gpu-node> | grep -A5 "Allocated"

# 4. 确认无污点阻止调度
kubectl describe node <gpu-node> | grep Taints

6.2 CUDA OOM

# 现象: RuntimeError: CUDA out of memory. Tried to allocate XXX MiB

# 原因排查
# 1. 检查显存碎片（nvidia-smi 显示显存占用但无活跃进程）
nvidia-smi

# 2. 僵尸进程占用
sudo fuser -v /dev/nvidia*

# 3. 检查是否多个容器共享同一 GPU 导致超卖
kubectl get pods -A -o json | \
  jq '[.items[] | select(.spec.containers[].resources.limits."nvidia.com/gpu" != null and .spec.nodeName == "<node>") | {name: .metadata.name, ns: .metadata.namespace, gpu: .status.hostIP}]'

# 缓解措施
# - 降低 --gpu-memory-utilization (vLLM): 0.90 → 0.85
# - 降低 --max-model-len (减少 KV Cache 预留)
# - 降低 --max-num-seqs (减少并发)
# - 量化模型: bf16 → int8

6.3 NCCL 通信超时

# 现象: Watchdog caught collective operation timeout
#       NCCL WARN Net plugin not found

# 排查
# 1. 确保使用 hostNetwork 或正确的网络接口
export NCCL_SOCKET_IFNAME=eth0    # 指定 NCCL 使用的网卡

# 2. 检查防火墙/NAT 规则
kubectl run nccl-test --rm -it --image=nvidia/cuda:12.4.0-runtime-ubuntu22.04 -- \
  bash -c "apt update && apt install -y netcat-openbsd && nc -zv <peer-ip> 22"

# 3. NCCL 详细日志
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH

# 4. 常见 NCCL 错误码
# NCCL WARN Net plugin not found → 缺少 libnccl-net.so
# NCCL WARN Failed to open libnccl-net.so → LD_LIBRARY_PATH 设置不对
# ncclSystemError → GPU 间通信链路中断

6.4 推理延迟突增

# 逐层排查
# L1: 推理服务层
kubectl logs -l app=vllm-qwen --tail=50
# 看是否有 "Preemption" "swap" 等

# L2: K8s 层
kubectl top pods -l app=vllm-qwen
# 看是否 CPU throttling / 内存压力

# L3: GPU 层
kubectl exec -it <gpu-pod> -- nvidia-smi
# 看 GPU 利用率、显存、温度、功耗

# L4: 网络层
# 推理网关 → vLLM 延迟
# 检查 Service mesh / CNI 是否有异常

# 常见根因:
# - KV Cache 满 → 请求在排队 → 扩容或降低 max-model-len
# - CPU throttling → 请求调度开销大 → 提升 CPU limits
# - GPU 降频 (温度 > 85°C) → 检查散热
# - 模型加载中 (preemption) → 加 readiness probe 延迟

7. GPU 集群运维 SOP

7.1 GPU 节点上下线

# === 新增 GPU 节点 ===
# 1. 安装驱动（如果 GPU Operator 没有 driver daemonset）
# 2. 添加节点标签
kubectl label node <new-gpu-node> \
  nvidia.com/gpu.present=true \
  accelerator=nvidia

# 3. 验证
kubectl get node <new-gpu-node> -o json | jq '.status.capacity."nvidia.com/gpu"'
kubectl -n gpu-operator logs -l app=nvidia-device-plugin

# === 下线 GPU 节点（有 GPU Pod 运行中）===
# 1. Cordon + Drain
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=600s

# 2. 如果 Pod 无法驱逐（无其他 GPU 节点可调度）
# 先确认有无 GPU Pod
kubectl get pods -A --field-selector spec.nodeName=<node> -o json | \
  jq '.items[] | select(.spec.containers[].resources.limits."nvidia.com/gpu" != null)'

# 3. 手动通知用户/迁移，或缩容后再 drain

7.2 GPU 故障处理流程

GPU 告警触发
    │
    ├─ 温度告警 → 检查机房散热、清洁风扇、降低负载
    │
    ├─ ECC 错误 → 立即隔离节点
    │   └─ kubectl cordon <node>
    │   └─ 运行 nvidia-smi -q -d ECC 查看详情
    │   └─ 申请硬件更换
    │
    ├─ XID 错误 → 查看 XID 码含义
    │   ├─ XID 48: Double Bit ECC Error → 硬件故障，更换GPU
    │   ├─ XID 79: GPU has fallen off the bus → PCIe 问题
    │   └─ XID 13: Graphics Engine Exception → 软件/驱动问题
    │
    └─ Pod OOM → 不存在 GPU 硬件问题
        └─ 参考 6.2 CUDA OOM 排查

7.3 常见 XID 错误码速查

XID	含义	处理
13	Graphics Engine Exception	重启驱动，如反复出现则硬件问题
31	GPU memory page fault	应用 bug 或显存故障
43	GPU stopped processing	通常为硬件问题
45	Preemptive cleanup (ECC)	ECC 内存页退役，监控趋势
48	Double Bit ECC Error	立即更换 GPU
61	Internal micro-controller error	尝试重启，反复出现则更换
79	GPU has fallen off the bus	PCIe 连接问题，检查插槽
119	GPU recovery action	驱动自动恢复，监控频率

8. AI 运维岗位面试高频考点

8.1 技术问题速查

问题	要点
”GPU Pod 启动不了怎么排查？“	Node GPU 容量 → Device Plugin 状态 → Taint/Toleration → NodeAffinity
”怎么在 K8s 上部署大模型推理？“	vLLM Deployment + GPU Operator + HPA + 模型存储策略
”CUDA OOM 怎么处理？“	降低 gpu-memory-utilization、降低 max-model-len、量化、增加 GPU
”GPTQ/AWQ/FP8 量化怎么选？“	GPTQ(平衡)、AWQ(速度优先)、FP8(H100)，需硬件支持
”多 GPU 推理怎么做？“	tensor-parallel-size=N，需要 NVLink 互联
”怎么降低 GPU 成本？“	MIG 分片、Spot 实例、时间片共享、量化、自动回收
”AI 推理 vs 传统 API 运维区别？“	状态ful（KV Cache）、显存管理、冷启动慢、GPU 不可超卖

8.2 开放题表达框架

“你们公司的 AI 基础设施是什么架构？”

回答框架（即使没实际做过也能量化表达理解）：

硬件层：A100/H100 GPU 集群 + InfiniBand/RoCE 网络
调度层：K8s + NVIDIA GPU Operator（Device Plugin + MIG + DCGM）
推理层：vLLM Deployment per model + LiteLLM 统一网关
存储层：模型仓库(S3) + PVC 热缓存 + 向量数据库(Milvus/Qdrant)
监控层：DCGM GPU 指标 + vLLM 推理指标 → Prometheus + Grafana
成本控制：MIG 多租户推理 + Spot 实例训练 + 闲置 GPU 自动回收

9. 动手实践路线

第1周: 本地 GPU 环境
├── 装 GPU 驱动 + CUDA Toolkit
├── 跑 nvidia-smi 熟悉 GPU 状态
└── docker run --gpus all nvidia/cuda nvidia-smi

第2周: K8s GPU 环境
├── 用 kind/minikube 搭建有 GPU 支持的 K8s (或用云GPU实例)
├── 装 GPU Operator
└── 部署一个 vLLM 实例，跑通推理

第3周: 推理服务化
├── 给 vLLM 加 Service + HPA
├── 用 Locust/wrk 压测，观察 GPU 指标
└── 配置 Prometheus + Grafana GPU Dashboard

第4周: 进阶
├── 部署 LiteLLM 推理网关
├── 尝试 MIG 分片
└── 部署 Qdrant 向量数据库