kube-scheduler

调度器需要充分考虑诸多的因素：

公平调度
资源高效利用
QoS
affinity 和 anti-affinity
数据本地化（data locality）
内部负载干扰（inter-workload interference）
deadlines

有三种方式指定 Pod 只运行在指定的 Node 节点上

nodeSelector：只调度到匹配指定 label 的 Node 上
nodeAffinity：功能更丰富的 Node 选择器，比如支持集合操作
podAffinity：调度到满足条件的 Pod 所在的 Node 上

首先给 Node 打上标签

然后在 daemonset 中指定 nodeSelector 为 disktype=ssd：

spec:
  nodeSelector:
    disktype: ssd

nodeAffinity 目前支持两种：requiredDuringSchedulingIgnoredDuringExecution 和 preferredDuringSchedulingIgnoredDuringExecution，分别代表必须满足条件和优选条件。比如下面的例子代表调度到包含标签 kubernetes.io/e2e-az-name 并且值为 e2e-az1 或 e2e-az2 的 Node 上，并且优选还带有标签 another-node-label-key=another-node-label-value 的 Node。

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - e2e-az1
            - e2e-az2
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: another-node-label-key
            operator: In
            values:
            - another-node-label-value
  containers:
  - name: with-node-affinity
    image: gcr.io/google_containers/pause:2.0

podAffinity 基于 Pod 的标签来选择 Node，仅调度到满足条件 Pod 所在的 Node 上，支持 podAffinity 和 podAntiAffinity。这个功能比较绕，以下面的例子为例：

如果一个 “Node 所在 Zone 中包含至少一个带有 security=S1 标签且运行中的 Pod”，那么可以调度到该 Node
不调度到 “包含至少一个带有 security=S2 标签且运行中 Pod” 的 Node 上

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            - S1
        topologyKey: failure-domain.beta.kubernetes.io/zone
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security
              operator: In
              values:
              - S2
          topologyKey: kubernetes.io/hostname
  containers:
  - name: with-pod-affinity
    image: gcr.io/google_containers/pause:2.0

Taints 和 tolerations

Taints 和 tolerations 用于保证 Pod 不被调度到不合适的 Node 上，其中 Taint 应用于 Node 上，而 toleration 则应用于 Pod 上。

目前支持的 taint 类型

NoSchedule：新的 Pod 不调度到该 Node 上，不影响正在运行的 Pod
PreferNoSchedule：soft 版的 NoSchedule，尽量不调度到该 Node 上
NoExecute：新的 Pod 不调度到该 Node 上，并且删除（evict）已在运行的 Pod。Pod 可以增加一个时间（tolerationSeconds），

比如，假设 node1 上应用以下几个 taint

下面的这个 Pod 由于没有 toleratekey2=value2:NoSchedule 无法调度到 node1 上

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"

而正在运行且带有 tolerationSeconds 的 Pod 则会在 600s 之后删除

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"
  tolerationSeconds: 600
- key: "key2"
  operator: "Equal"
  value: "value2"
  effect: "NoSchedule"

注意，DaemonSet 创建的 Pod 会自动加上对 node.alpha.kubernetes.io/unreachable 和 node.alpha.kubernetes.io/notReady 的 NoExecute Toleration，以避免它们因此被删除。

优先级调度

从 v1.8 开始，kube-scheduler 支持定义 Pod 的优先级，从而保证高优先级的 Pod 优先调度。并从 v1.11 开始默认开启。

在指定 Pod 的优先级之前需要先定义一个 PriorityClass（非 namespace 资源），如

apiVersion: v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service pods only."

其中

为 32 位整数的优先级，该值越大，优先级越高
globalDefault 用于未配置 PriorityClassName 的 Pod，整个集群中应该只有一个 PriorityClass 将其设置为 true

如果默认的调度器不满足要求，还可以部署自定义的调度器。并且，在整个集群中还可以同时运行多个调度器实例，通过 podSpec.schedulerName 来选择使用哪一个调度器（默认使用内置的调度器）。

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  # 选择使用自定义调度器 my-scheduler
  schedulerName: my-scheduler
  containers:
  - name: nginx
    image: nginx:1.10

调度器的示例参见这里。

调度器扩展

kube-scheduler 还支持使用 --policy-config-file 指定一个调度策略文件来自定义调度策略，比如

"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
    {"name" : "PodFitsHostPorts"},
    {"name" : "PodFitsResources"},
    {"name" : "NoDiskConflict"},
    {"name" : "MatchNodeSelector"},
    {"name" : "HostName"}
    ],
"priorities" : [
    {"name" : "LeastRequestedPriority", "weight" : 1},
    {"name" : "BalancedResourceAllocation", "weight" : 1},
    {"name" : "ServiceSpreadingPriority", "weight" : 1},
    {"name" : "EqualPriority", "weight" : 1}
    ],
"extenders":[
    {
        "urlPrefix": "http://127.0.0.1:12346/scheduler",
        "apiVersion": "v1beta1",
        "filterVerb": "filter",
        "prioritizeVerb": "prioritize",
        "weight": 5,
        "enableHttps": false,
        "nodeCacheCapable": false
    }
    ]
}

其他影响调度的因素

如果 Node Condition 处于 MemoryPressure，则所有 BestEffort 的新 Pod（未指定 resources limits 和 requests）不会调度到该 Node 上
如果 Node Condition 处于 DiskPressure，则所有新 Pod 都不会调度到该 Node 上
为了保证 Critical Pods 的正常运行，当它们处于异常状态时会自动重新调度。Critical Pods 是指
- annotation 包括 scheduler.alpha.kubernetes.io/critical-pod=''
- tolerations 包括 [{"key":"CriticalAddonsOnly", "operator":"Exists"}]
- priorityClass 为 system-cluster-critical 或者 system-node-critical

kube-scheduler --address=127.0.0.1 --leader-elect=true --kubeconfig=/etc/kubernetes/scheduler.conf

kube-scheduler 工作原理

kube-scheduler 调度原理：

kube-scheduler 调度分为两个阶段，predicate 和 priority

predicate：过滤不符合条件的节点
priority：优先级排序，选择优先级最高的节点

predicates 策略

PodFitsPorts：同 PodFitsHostPorts
PodFitsHostPorts：检查是否有 Host Ports 冲突
PodFitsResources：检查 Node 的资源是否充足，包括允许的 Pod 数量、CPU、内存、GPU 个数以及其他的 OpaqueIntResources
HostName：检查 pod.Spec.NodeName 是否与候选节点一致
MatchNodeSelector：检查候选节点的是否匹配
NoVolumeZoneConflict：检查 volume zone 是否冲突
MaxEBSVolumeCount：检查 AWS EBS Volume 数量是否过多（默认不超过 39）
MaxGCEPDVolumeCount：检查 GCE PD Volume 数量是否过多（默认不超过 16）
MaxAzureDiskVolumeCount：检查 Azure Disk Volume 数量是否过多（默认不超过 16）
MatchInterPodAffinity：检查是否匹配 Pod 的亲和性要求
NoDiskConflict：检查是否存在 Volume 冲突，仅限于 GCE PD、AWS EBS、Ceph RBD 以及 ISCSI
GeneralPredicates：分为 noncriticalPredicates 和 EssentialPredicates。noncriticalPredicates 中包含 PodFitsResources，EssentialPredicates 中包含 PodFitsHost，PodFitsHostPorts 和 PodSelectorMatches。
PodToleratesNodeTaints：检查 Pod 是否容忍 Node Taints
CheckNodeMemoryPressure：检查 Pod 是否可以调度到 MemoryPressure 的节点上
CheckNodeDiskPressure：检查 Pod 是否可以调度到 DiskPressure 的节点上
NoVolumeNodeConflict：检查节点是否满足 Pod 所引用的 Volume 的条件

priorities 策略

SelectorSpreadPriority：优先减少节点上属于同一个 Service 或 Replication Controller 的 Pod 数量
InterPodAffinityPriority：优先将 Pod 调度到相同的拓扑上（如同一个节点、Rack、Zone 等）
LeastRequestedPriority：优先调度到请求资源少的节点上
BalancedResourceAllocation：优先平衡各节点的资源使用
NodePreferAvoidPodsPriority：alpha.kubernetes.io/preferAvoidPods 字段判断, 权重为 10000，避免其他优先级策略的影响
NodeAffinityPriority：优先调度到匹配 NodeAffinity 的节点上
TaintTolerationPriority：优先调度到匹配 TaintToleration 的节点上
ServiceSpreadingPriority：尽量将同一个 service 的 Pod 分布到不同节点上，已经被 SelectorSpreadPriority 替代 [默认未使用]
EqualPriority：将所有节点的优先级设置为 1[默认未使用]
ImageLocalityPriority：尽量将使用大镜像的容器调度到已经下拉了该镜像的节点上 [默认未使用]
MostRequestedPriority：尽量调度到已经使用过的 Node 上，特别适用于 cluster-autoscaler[默认未使用]

代码入口路径

参考文档

Configure Multiple Schedulers