kube-scheduler

    调度器需要充分考虑诸多的因素:

    • 公平调度
    • 资源高效利用
    • QoS
    • affinity 和 anti-affinity
    • 数据本地化(data locality)
    • 内部负载干扰(inter-workload interference)
    • deadlines

    有三种方式指定Pod只运行在指定的Node节点上

    • nodeSelector:只调度到匹配指定label的Node上
    • nodeAffinity:功能更丰富的Node选择器,比如支持集合操作
    • podAffinity:调度到满足条件的Pod所在的Node上

    首先给Node打上标签

    然后在daemonset中指定nodeSelector为disktype=ssd

    1. spec:
    2. nodeSelector:
    3. disktype: ssd

    nodeAffinity目前支持两种:requiredDuringSchedulingIgnoredDuringExecution和preferredDuringSchedulingIgnoredDuringExecution,分别代表必须满足条件和优选条件。比如下面的例子代表调度到包含标签kubernetes.io/e2e-az-name并且值为e2e-az1或e2e-az2的Node上,并且优选还带有标签another-node-label-key=another-node-label-value的Node。

    1. apiVersion: v1
    2. kind: Pod
    3. metadata:
    4. name: with-node-affinity
    5. spec:
    6. affinity:
    7. nodeAffinity:
    8. requiredDuringSchedulingIgnoredDuringExecution:
    9. nodeSelectorTerms:
    10. - matchExpressions:
    11. - key: kubernetes.io/e2e-az-name
    12. operator: In
    13. values:
    14. - e2e-az1
    15. - e2e-az2
    16. preferredDuringSchedulingIgnoredDuringExecution:
    17. - weight: 1
    18. preference:
    19. matchExpressions:
    20. - key: another-node-label-key
    21. operator: In
    22. values:
    23. - another-node-label-value
    24. containers:
    25. - name: with-node-affinity
    26. image: gcr.io/google_containers/pause:2.0

    podAffinity基于Pod的标签来选择Node,仅调度到满足条件Pod所在的Node上,支持podAffinity和podAntiAffinity。这个功能比较绕,以下面的例子为例:

    • 如果一个“Node所在Zone中包含至少一个带有security=S1标签且运行中的Pod”,那么可以调度到该Node
    • 不调度到“包含至少一个带有security=S2标签且运行中Pod”的Node上
    1. apiVersion: v1
    2. kind: Pod
    3. metadata:
    4. spec:
    5. affinity:
    6. podAffinity:
    7. requiredDuringSchedulingIgnoredDuringExecution:
    8. - labelSelector:
    9. matchExpressions:
    10. - key: security
    11. operator: In
    12. values:
    13. - S1
    14. podAntiAffinity:
    15. preferredDuringSchedulingIgnoredDuringExecution:
    16. - weight: 100
    17. podAffinityTerm:
    18. labelSelector:
    19. matchExpressions:
    20. - key: security
    21. operator: In
    22. values:
    23. - S2
    24. topologyKey: kubernetes.io/hostname
    25. containers:
    26. - name: with-pod-affinity
    27. image: gcr.io/google_containers/pause:2.0

    Taints和tolerations

    目前支持的taint类型

    • NoSchedule:新的Pod不调度到该Node上,不影响正在运行的Pod
    • PreferNoSchedule:soft版的NoSchedule,尽量不调度到该Node上
    • NoExecute:新的Pod不调度到该Node上,并且删除(evict)已在运行的Pod。Pod可以增加一个时间(tolerationSeconds),

    然而,当Pod的Tolerations匹配Node的所有Taints的时候可以调度到该Node上;当Pod是已经运行的时候,也不会被删除(evicted)。另外对于NoExecute,如果Pod增加了一个tolerationSeconds,则会在该时间之后才删除Pod。

    比如,假设node1上应用以下几个taint

    下面的这个Pod由于没有toleratekey2=value2:NoSchedule无法调度到node1上

    1. tolerations:
    2. - key: "key1"
    3. operator: "Equal"
    4. value: "value1"
    5. effect: "NoSchedule"
    6. - key: "key1"
    7. operator: "Equal"
    8. value: "value1"
    9. effect: "NoExecute"

    而正在运行且带有tolerationSeconds的Pod则会在600s之后删除

    1. tolerations:
    2. - key: "key1"
    3. operator: "Equal"
    4. value: "value1"
    5. effect: "NoSchedule"
    6. - key: "key1"
    7. operator: "Equal"
    8. value: "value1"
    9. effect: "NoExecute"
    10. tolerationSeconds: 600
    11. - key: "key2"
    12. operator: "Equal"
    13. value: "value2"
    14. effect: "NoSchedule"

    注意,DaemonSet创建的Pod会自动加上对node.alpha.kubernetes.io/unreachablenode.alpha.kubernetes.io/notReady的NoExecute Toleration,以避免它们因此被删除。

    • apiserver配置--feature-gates=PodPriority=true--runtime-config=scheduling.k8s.io/v1alpha1=true
    • kube-scheduler配置--feature-gates=PodPriority=true

    在指定Pod的优先级之前需要先定义一个PriorityClass(非namespace资源),如

    1. apiVersion: v1
    2. kind: PriorityClass
    3. name: high-priority
    4. value: 1000000
    5. description: "This priority class should be used for XYZ service pods only."

    其中

    • value 为32位整数的优先级,该值越大,优先级越高
    • globalDefault 用于未配置PriorityClassName的Pod,整个集群中应该只有一个PriorityClass将其设置为true

    然后,在PodSpec中通过PriorityClassName设置Pod的优先级:

    多调度器

    如果默认的调度器不满足要求,还可以部署自定义的调度器。并且,在整个集群中还可以同时运行多个调度器实例,通过podSpec.schedulerName来选择使用哪一个调度器(默认使用内置的调度器)。

    1. apiVersion: v1
    2. kind: Pod
    3. metadata:
    4. name: nginx
    5. labels:
    6. app: nginx
    7. spec:
    8. # 选择使用自定义调度器my-scheduler
    9. schedulerName: my-scheduler
    10. containers:
    11. - name: nginx
    12. image: nginx:1.10

    调度器的示例参见这里

    kube-scheduler还支持使用--policy-config-file指定一个调度策略文件来自定义调度策略,比如

    1. {
    2. "kind" : "Policy",
    3. "apiVersion" : "v1",
    4. "predicates" : [
    5. {"name" : "PodFitsHostPorts"},
    6. {"name" : "PodFitsResources"},
    7. {"name" : "NoDiskConflict"},
    8. {"name" : "MatchNodeSelector"},
    9. {"name" : "HostName"}
    10. ],
    11. "priorities" : [
    12. {"name" : "LeastRequestedPriority", "weight" : 1},
    13. {"name" : "BalancedResourceAllocation", "weight" : 1},
    14. {"name" : "ServiceSpreadingPriority", "weight" : 1},
    15. {"name" : "EqualPriority", "weight" : 1}
    16. ],
    17. "extenders":[
    18. {
    19. "urlPrefix": "http://127.0.0.1:12346/scheduler",
    20. "apiVersion": "v1beta1",
    21. "filterVerb": "filter",
    22. "prioritizeVerb": "prioritize",
    23. "weight": 5,
    24. "enableHttps": false,
    25. "nodeCacheCapable": false
    26. }
    27. ]
    28. }

    其他影响调度的因素

    • 如果Node Condition处于MemoryPressure,则所有BestEffort的新Pod(未指定resources limits和requests)不会调度到该Node上
    • 如果Node Condition处于DiskPressure,则所有新Pod都不会调度到该Node上
    • 为了保证Critical Pods的正常运行,当它们处于异常状态时会自动重新调度。Critical Pods是指
      • annotation包括scheduler.alpha.kubernetes.io/critical-pod=''
      • tolerations包括[{"key":"CriticalAddonsOnly", "operator":"Exists"}]
    1. kube-scheduler --address=127.0.0.1 --leader-elect=true --kubeconfig=/etc/kubernetes/scheduler.conf

    参考文档