网站首页 > 知识剖析正文

Kubernetes 25 大高频雷区与修复方案

nixiaole 2025-09-15 00:12:04 知识剖析 3 ℃

Kubernetes 25 大高频雷区与修复方案

Kubernetes 功能强大，但使用不当容易掉坑。以下总结了 25 个高频雷区，每个雷区都附带 修复方案 + YAML 示例，帮助你快速落地。

一、基础配置类雷区

1. 未设置资源请求和限制

问题：不设定 resources，可能导致某个 Pod 抢占所有资源，影响其他服务。
修复方案：为容器定义合理的 CPU、内存请求与限制。

resources:
  requests:
    cpu: "250m"
    memory: "64Mi"
  limits:
    cpu: "500m"
    memory: "128Mi"

2. 使用 latest 镜像标签

问题：latest 不可控，容易导致回滚困难。
修复方案：固定镜像版本。

containers:
- name: my-app
  image: my-app:v1.2.3 # 避免 latest

3. 未配置存活探针与就绪探针

问题：Pod 宕机、异常时，K8s 无法感知。
修复方案：加上 livenessProbe 和 readinessProbe。

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

4. 配置文件中存放明文密码

问题：高风险，容易泄露。
修复方案：用 Secret 管理。

apiVersion: v1
kind: Secret
metadata:
  name: db-secret
type: Opaque
data:
  password: cGFzc3dvcmQ= # base64 编码后的 password

env:
- name: DB_PASSWORD
  valueFrom:
    secretKeyRef:
      name: db-secret
      key: password

5. 所有 Pod 都落到同一节点

问题：存在单点风险。
修复方案：配置 Pod 反亲和性。

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values: ["my-app"]
      topologyKey: "kubernetes.io/hostname"

6. 滚动升级时所有 Pod 被杀掉

问题：影响服务可用性。
修复方案：配置 PDB 保证最小可用实例数。

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

7. 不同环境混用同一命名空间

问题：资源隔离差，容易误操作。
修复方案：按环境划分命名空间。

apiVersion: v1
kind: Namespace
metadata:
  name: production

8. 无统一日志采集

问题：难以排错，运维痛苦。
修复方案：使用 DaemonSet 部署日志采集组件。

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
spec:
  selector:
    matchLabels:
      app: fluentd
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      containers:
      - name: fluentd
        image: fluent/fluentd:v1.14

9. ServiceAccount 默认过大权限

问题：存在安全隐患。
修复方案：最小化 RBAC 权限。

apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app-sa

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: my-app-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: my-app-rb
subjects:
- kind: ServiceAccount
  name: my-app-sa
roleRef:
  kind: Role
  name: my-app-role
  apiGroup: rbac.authorization.k8s.io

10. 使用 NodePort 直接暴露服务

问题：不安全且不可维护。
修复方案：使用 Ingress 统一入口。

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app-ingress
spec:
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-app
            port:
              number: 80

11. 手工改 YAML，无版本管理

问题：环境不可控。
修复方案：用 GitOps 管理配置。

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  source:
    repoURL: https://github.com/org/repo
    targetRevision: main
    path: manifests/my-app
  destination:
    namespace: production
    server: https://kubernetes.default.svc
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

12. 容器以 root 用户运行

问题：安全性极差。
修复方案：指定非 root 用户。

securityContext:
  runAsUser: 1000
  runAsGroup: 3000
  fsGroup: 2000

13. 所有 YAML 复制粘贴，缺少模板

问题：管理复杂，难以统一升级。
修复方案：用 Helm / Kustomize。

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
- ingress.yaml

14. 临时存储导致数据丢失

问题：Pod 重建后数据消失。
修复方案：使用 PVC。

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

15. Job 重启策略配置错误

问题：Job 无限重启。
修复方案：配置 restartPolicy: OnFailure。

apiVersion: batch/v1
kind: Job
metadata:
  name: my-job
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: job
        image: busybox
        command: ["echo", "hello"]

二、进阶类雷区

16. 节点调度不合理

问题：计算/存储资源分配混乱。
修复方案：配置 NodeAffinity。

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node-type
          operator: In
          values: ["gpu"]

17. 有状态服务用 Deployment

问题：Pod 重建丢失数据。
修复方案：使用 StatefulSet。

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
spec:
  serviceName: mysql
  replicas: 3
  selector:
    matchLabels:
      app: mysql
  template:
    metadata:
      labels:
        app: mysql
    spec:
      containers:
      - name: mysql
        image: mysql:8
        volumeMounts:
        - name: data
          mountPath: /var/lib/mysql
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 5Gi

18. 核心服务被低优先级任务挤占

修复方案：定义 PriorityClass。

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 100000
globalDefault: false
description: "For critical workloads"

19. 依赖服务未启动导致应用失败

修复方案：使用 initContainer。

initContainers:
- name: init-db
  image: busybox
  command: ['sh', '-c', 'until nc -z db 3306; do sleep 2; done;']

20. 未备份 etcd

修复方案：定时备份 etcd。

apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
spec:
  schedule: "0 */6 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: bitnami/etcd
            command: ["/bin/sh", "-c", "etcdctl snapshot save /backup/etcd-$(date +%F).db"]
          restartPolicy: OnFailure

21. 单体应用强行塞进 K8s

修复方案：按微服务拆分。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: user
  template:
    metadata:
      labels:
        app: user
    spec:
      containers:
      - name: user
        image: user-service:v1.0

22. 忽略 Pod 安全策略

修复方案：开启 PodSecurity Admission。

apiVersion: v1
kind: Namespace
metadata:
  name: secure-ns
  labels:
    pod-security.kubernetes.io/enforce: restricted

23. 不使用自动扩缩容

修复方案：配置 HPA。

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

24. 默认所有 Pod 互通

修复方案：定义 NetworkPolicy。

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

25. 所有环境混在一个集群无隔离

修复方案：多租户隔离命名空间。

apiVersion: v1
kind: Namespace
metadata:
  name: prod
---
apiVersion: v1
kind: Namespace
metadata:
  name: staging
---
apiVersion: v1
kind: Namespace
metadata:
  name: dev

上一篇： Vue3基础难点总结_vue3 从入门到实战
下一篇：亲测:把家里旧电脑改成服务器节点，每月省 300 块，操作超简单

网站首页 > 知识剖析 正文

Kubernetes 25 大高频雷区与修复方案

Kubernetes 25 大高频雷区与修复方案

一、基础配置类雷区

1. 未设置资源请求和限制

2. 使用 latest 镜像标签

3. 未配置存活探针与就绪探针

4. 配置文件中存放明文密码

5. 所有 Pod 都落到同一节点

6. 滚动升级时所有 Pod 被杀掉

7. 不同环境混用同一命名空间

8. 无统一日志采集

9. ServiceAccount 默认过大权限

10. 使用 NodePort 直接暴露服务

11. 手工改 YAML，无版本管理

12. 容器以 root 用户运行

13. 所有 YAML 复制粘贴，缺少模板

14. 临时存储导致数据丢失

15. Job 重启策略配置错误

二、进阶类雷区

16. 节点调度不合理

17. 有状态服务用 Deployment

18. 核心服务被低优先级任务挤占

19. 依赖服务未启动导致应用失败

20. 未备份 etcd

21. 单体应用强行塞进 K8s

22. 忽略 Pod 安全策略

23. 不使用自动扩缩容

24. 默认所有 Pod 互通

25. 所有环境混在一个集群无隔离

猜你喜欢

网站首页 > 知识剖析正文