OpenTelemetry高级架构

核心内容

1. OTel Operator：K8s 原生自动管理

OTel Operator 是管理 Collector 生命周期 + 自动插桩的 K8s Operator，把 Collector 从”手动部署”提升为”声明式管理”。

1.1 安装

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-operator open-telemetry/opentelemetry-operator \
  --namespace observability \
  --create-namespace

1.2 CRD：声明式 Collector 部署

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-gateway
  namespace: observability
spec:
  mode: deployment          # Agent: DaemonSet | Gateway: Deployment | Sidecar
  replicas: 3
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 2
      memory: 2Gi
  # 自动注入的环境变量
  env:
    - name: OTEL_RESOURCE_ATTRIBUTES
      value: "environment=production,cluster=us-east-1"
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    processors:
      memory_limiter:
        check_interval: 1s
        limit_mib: 1024
      batch:
        timeout: 1s
        send_batch_size: 1024
      k8sattributes:
        extract:
          metadata:
            - k8s.pod.name
            - k8s.namespace.name
            - k8s.deployment.name
        pod_association:
          - sources:
              - from: resource_attribute
                name: k8s.pod.ip
    exporters:
      otlp/tempo:
        endpoint: tempo.monitoring:4317
        tls:
          insecure: true
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, k8sattributes, batch]
          exporters: [otlp/tempo]

1.3 自动插桩注入（Instrumentation CRD）

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: java-instrumentation
  namespace: production
spec:
  exporter:
    endpoint: http://otel-collector.observability:4317
  propagators:
    - tracecontext
    - baggage
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
# 给目标 namespace 打 annotation 启用自动注入
# kubectl annotate namespace production instrumentation.opentelemetry.io/inject-java=true

一旦启用，所有 Pod 自动注入 Java agent / Python 包，零代码。

2. 大规模 Collector 部署架构

2.1 三层架构（1000+ 服务）

                        ┌─────────────┐
                        │   Tempo     │
                        │  (Traces)   │
                        └──────▲──────┘
                               │ OTLP
                    ┌──────────┴──────────┐
                    │    Gateway Cluster  │
                    │   (3-5 副本)         │
                    │   全局采样 + 聚合    │
                    └──────────▲──────────┘
                               │ OTLP
               ┌───────────────┼───────────────┐
               │               │               │
    ┌──────────┴───┐  ┌───────┴──────┐  ┌─────┴─────────┐
    │ Agent Group A│  │ Agent Group B│  │ Agent Group C │
    │ (DaemonSet)  │  │ (DaemonSet)  │  │ (DaemonSet)   │
    │ 本地收+转     │  │ 本地收+转     │  │ 本地收+转      │
    └──────▲───────┘  └──────▲───────┘  └──────▲────────┘
           │                 │                 │
    ┌──────┴──────┐  ┌──────┴──────┐  ┌──────┴────────┐
    │  K8s 集群 A  │  │  K8s 集群 B  │  │  K8s 集群 C    │
    │  (200+ Pod) │  │  (150+ Pod) │  │  (300+ Pod)   │
    └─────────────┘  └─────────────┘  └───────────────┘

2.2 Agent 模式 vs Gateway 模式的分工

层级	角色	关键配置	资源建议
Agent (DaemonSet)	本地收数据、K8s 元数据注入、简单压缩	`k8sattributes` processor、`batch`、`memory_limiter`	256Mi~512Mi / 200m CPU
Gateway (Deployment)	全局采样、聚合多个 Agent、负载均衡、租户隔离	`tail_sampling`、`routing`、`batch(大 batch)`	1Gi~4Gi / 1-4 CPU

2.3 Gateway 层 Load Balancing

# Agent 端配置：负载均衡到多个 Gateway 实例
exporters:
  loadbalancing:
    routing_key: "traceID"        # 按 traceID hash，保证同一 Trace 去同一 Gateway
    protocol:
      otlp:
        tls:
          insecure: true
    resolver:
      dns:
        hostname: otel-gateway.observability.svc.cluster.local
        port: 4317

关键：routing_key: "traceID" 确保 tail_sampling 能收集到完整的 Trace（所有 Span 落在同一个 Gateway 实例上）。

3. 多租户架构

# Gateway 端配置：按租户路由到不同后端
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  # 提取租户信息（从 HTTP header x-tenant-id 或 grpc metadata）
  attributes/tenant:
    actions:
      - key: tenant_id
        from_context: tenant_id   # 从请求元数据提取
        action: upsert

exporters:
  otlp/tenant-a:
    endpoint: tempo-tenant-a:4317
  otlp/tenant-b:
    endpoint: tempo-tenant-b:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [attributes/tenant]
      exporters: [otlp/tenant-a, otlp/tenant-b]

4. Collector 性能调优

4.1 瓶颈诊断指标

# Collector 自身 Metrics（端口 8888）

# 队列满 → 数据丢失风险
otelcol_exporter_queue_size{exporter="otlp/tempo"} / otelcol_exporter_queue_capacity

# Receiver 拒绝数 → 上游 SDK 开始重试
rate(otelcol_receiver_refused_spans[5m])

# 处理延迟 → 判断是否需要扩容
rate(otelcol_processor_batch_batch_send_size_bytes[5m])

# 内存使用 → OOM 预警
otelcol_process_memory_rss

4.2 常见瓶颈与优化

瓶颈	现象	优化
gRPC 连接数过多	SDK 重连频繁	Gateway 层增加副本 + DNS load balancing
batch send 太频繁	Exporter 网络开销大	增大 `send_batch_size` 到 4096+
tail_sampling 内存不足	OOM Kill	减少 `num_traces` 或缩短 `decision_wait`
磁盘 I/O 成为瓶颈	使用 file exporter 时	换成内存队列 + OTLP exporter
Exporter 远端慢	队列堆积	增加 `sending_queue.queue_size` + `num_consumers`

4.3 Exporter 反压配置

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    sending_queue:
      enabled: true
      num_consumers: 10        # 并发发送线程
      queue_size: 5000         # 队列容量（内存中最多缓存 5000 批）
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 30s
      max_elapsed_time: 300s   # 5 分钟后放弃重试

5. OTel 与 eBPF 的结合

当无法修改应用代码或语言不支持自动插桩时，eBPF 提供了操作系统层面的自动 Tracing。

# 使用 Pixie (基于 eBPF) 生产 OTel 数据
# Pixie 不需要任何代码改动，通过 K8s DaemonSet 部署
# 自动生成:
#   - HTTP/gRPC 请求的 Span（包含请求体大小、状态码）
#   - DNS 查询的 Span
#   - 协议级别的 network metrics

# 安装 Pixie
px deploy

# Pixie 自动导出 OTel 数据到 Collector
px run -o otlp px/http_data

OTel 官方也在推进 eBPF agent（Go 语言）：不修改二进制，通过 eBPF uprobe 拦截函数调用生成 Span。目前处于实验阶段，但对 Go 服务是巨大的进步。

6. 安全与合规

6.1 mTLS：Agent → Gateway → Backend

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        tls:
          cert_file: /certs/server.crt
          key_file: /certs/server.key
          client_ca_file: /certs/ca.crt   # 要求客户端证书

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      cert_file: /certs/client.crt
      key_file: /certs/client.key
      ca_file: /certs/ca.crt

6.2 敏感数据脱敏

processors:
  redaction:
    # 脱敏正则：替换匹配到的值为 "****"
    allow_all_keys: false      # 默认不保留任何 key（白名单模式）
    allowed_keys:
      - http.method
      - http.status_code
      - http.route
    blocked_values:
      - '\b\d{16}\b'            # 信用卡号
      - '\b[A-Z]{2}\d{18}\b'    # 身份证号
      - '(?i)password|secret|token|api_key'

  attributes/remove_sensitive:
    actions:
      - key: user.email
        action: delete
      - key: user.phone
        action: delete
      - key: http.request.header.authorization
        action: delete

6.3 数据保留策略

Tempo 侧配置：
- 正常 Trace: 保留 7 天
- 错误 Trace: 保留 30 天（通过 WAL 标记）
- 采样下来的 Trace: 按 ingress 速率动态调整（10% 采样 → 保留 14 天）

Collector 侧：通过 routing connector 把错误 Trace 单独导出到不同保留策略的后端

7. 从 Jaeger/Zipkin 迁移到 OTel

7.1 Jaeger → OTel 迁移路径

阶段 1: 双写（2 周）
  App → [Jaeger Agent] → Jaeger Collector → Jaeger Backend
  App → [OTel SDK + Jaeger Exporter] → OTel Collector (吐 Jaeger thrift)

阶段 2: 切换后端（1 周）
  App → [OTel SDK] → OTel Collector → Tempo
  同时 OTel Collector 兼容接收 Jaeger thrift 协议

阶段 3: 清理（1 周）
  移除 Jaeger Agent/Collector
  可选：保留 Jaeger UI 指向 Tempo 后端

7.2 OTel Collector 兼容 Jaeger 接收

receivers:
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250   # Jaeger gRPC 端口
      thrift_http:
        endpoint: 0.0.0.0:14268   # Jaeger HTTP 端口

这意味着：你可以先不改任何应用代码，只要把 Jaeger Agent 的 exporter 指向 OTel Collector，就可以逐步迁移。

8. Collector 高可用设计

8.1 Agent 层（DaemonSet）

Agent 层天然高可用——每个节点一个，Pod 挂了 K8s 自动重启。但需要注意：

不存在单点问题：本地 Pod 发给本地 Agent，Agent 挂了只影响当前节点
反压处理：Gateway 不可用时，Agent 的 sending_queue 提供缓冲

8.2 Gateway 层（Deployment）

# Gateway 高可用关键配置
spec:
  replicas: 3
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: otel-gateway
        topologyKey: kubernetes.io/hostname  # 每个节点最多一个 Gateway

  # PodDisruptionBudget: 保证至少 2 个实例运行
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: otel-gateway-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: otel-gateway

8.3 灾难场景处理

场景	影响	缓解
Agent 宕机	当前节点数据丢失	DaemonSet 自动恢复，临时丢失 30s 数据
Gateway 宕机	Agent 队列堆积	Agent `sending_queue` 缓冲 + 多副本
Tempo 宕机	数据丢失	Collector `retry_on_failure` 持续重试 5min
网络分区	Agent 无法连 Gateway	Agent 内存缓冲 → 网络恢复后重发
OOM Kill	全量数据丢失	`memory_limiter` + `spike_limit` 防止雪崩

9. 成本模型与数据量预估

# Trace 数据量估算公式
data_per_second = requests_per_second × avg_spans_per_request × avg_span_size

# 典型值：
# requests_per_second = 1000 (RPS)
# avg_spans_per_request = 10 (一次请求经过 10 个服务)
# avg_span_size = 2KB (含 attributes)
# → data_per_second = 1000 × 10 × 2KB = 20 MB/s (未压缩)
# → 压缩后 (gzip 5x) ≈ 4 MB/s → ~345 GB/天 (未采样时)

# 10% 采样后: ~34.5 GB/天 → ~1 TB/月

成本优化四板斧：

手段	节省	代价
Head-based 概率采样	直接减少 90%+	丢失大部分正常 Trace
Tail-based 错误保留	正常链路少但错误 100%	Collector 内存开销
只保留必要 Attribute	20-40%	排查信息减少
调整 Tempo 保留期	存储成本成比例下降	历史数据不可查

10. 生产故障模式集

故障	根因	检测方式	修复
SDK exporter 连不上 Collector	网络/地址错误	SDK 日志 `failed to export`	检查 endpoint + 防火墙
Collector 频繁 OOM	内存配置过低	`otelcol_process_memory_rss > limit`	增大 `memory_limiter` limit
SDK 大量 Span 被丢弃	Exporter 队列满	`otelcol_exporter_queue_size ≈ capacity`	扩容 Gateway 或增大队列
Trace 不完整（缺 Span）	tail_sampling 时间窗口不够	火焰图断层	增大 `decision_wait`
Tempo 查询超时	数据量过大或索引碎片	Grafana 返回 504	压缩 index、增加 ingester 副本
span_id 碰撞（极低概率但会发生）	随机 id 碰撞	火焰图出现环	OTel SDK 1.0+ 已使用足够长的随机 id，概率可忽略
OTLP 协议版本不兼容	SDK 和 Collector 版本差太大	`Unimplemented` gRPC 错误	升级 Collector 或降级 SDK

OpenTelemetry 高级架构与实践