Troubleshooting

    This guide assumes that you have read the Concepts which explains all the components and concepts.

    We use GitHub issues to maintain a list of . You can also check there to see if your question(s) is already addressed.

    An initial overview of Cilium can be retrieved by listing all pods to verify whether all pods have the status :

    If Cilium encounters a problem that it cannot recover from, it will automatically report the failure state via cilium status which is regularly queried by the Kubernetes liveness probe to automatically restart Cilium pods. If a Cilium pod is in state CrashLoopBackoff then this indicates a permanent failure scenario.

    Detailed Status

    If a particular Cilium pod is not in running state, the status and health of the agent on that node can be retrieved by running cilium status in the context of that pod:

    1. $ kubectl -n kube-system exec -ti cilium-2hq5z -- cilium status
    2. KVStore: Ok etcd: 1/1 connected: http://demo-etcd-lab--a.etcd.tgraf.test1.lab.corp.isovalent.link:2379 - 3.2.5 (Leader)
    3. ContainerRuntime: Ok docker daemon: OK
    4. Kubernetes: Ok OK
    5. Kubernetes APIs: ["cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service", "core/v1::Endpoint", "core/v1::Node", "CustomResourceDefinition"]
    6. Cilium: Ok OK
    7. NodeMonitor: Disabled
    8. Cilium health daemon: Ok
    9. Controller Status: 14/14 healthy
    10. Proxy Status: OK, ip 10.2.0.172, port-range 10000-20000
    11. Cluster health: 4/4 reachable (2018-06-16T09:49:58Z)

    Alternatively, the k8s-cilium-exec.sh script can be used to run cilium status on all nodes. This will provide detailed status and health information of all nodes in the cluster:

    1. $ curl -sLO releases.cilium.io/v1.1.0/tools/k8s-cilium-exec.sh
    2. $ chmod +x ./k8s-cilium-exec.sh

    … and run cilium status on all nodes:

    1. $ ./k8s-cilium-exec.sh cilium status
    2. KVStore: Ok Etcd: http://127.0.0.1:2379 - (Leader) 3.1.10
    3. ContainerRuntime: Ok
    4. Kubernetes: Ok OK
    5. Kubernetes APIs: ["extensions/v1beta1::Ingress", "core/v1::Node", "CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service", "core/v1::Endpoint"]
    6. Cilium: Ok OK
    7. NodeMonitor: Listening for events on 2 CPUs with 64x4096 of shared memory
    8. Cilium health daemon: Ok
    9. Controller Status: 7/7 healthy
    10. Proxy Status: OK, ip 10.15.28.238, 0 redirects, port-range 10000-20000
    11. Cluster health: 1/1 reachable (2018-02-27T00:24:34Z)

    Detailed information about the status of Cilium can be inspected with the cilium status --verbose command. Verbose output includes detailed IPAM state (allocated addresses), Cilium controller status, and details of the Proxy status.

    Logs

    To retrieve log files of a cilium pod, run (replace cilium-1234 with a pod name returned by kubectl -n kube-system get pods -l k8s-app=cilium)

    1. $ kubectl -n kube-system logs --timestamps cilium-1234

    If the cilium pod was already restarted due to the liveness problem after encountering an issue, it can be useful to retrieve the logs of the pod before the last restart:

    1. $ kubectl -n kube-system logs --timestamps -p cilium-1234

    Generic

    When logged in a host running Cilium, the cilium CLI can be invoked directly, e.g.:

    1. $ cilium status
    2. KVStore: Ok etcd: 1/1 connected: https://192.168.33.11:2379 - 3.2.7 (Leader)
    3. ContainerRuntime: Ok
    4. Kubernetes: Ok OK
    5. Kubernetes APIs: ["core/v1::Endpoint", "extensions/v1beta1::Ingress", "core/v1::Node", "CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service"]
    6. Cilium: Ok OK
    7. NodeMonitor: Listening for events on 2 CPUs with 64x4096 of shared memory
    8. Cilium health daemon: Ok
    9. IPv4 address pool: 261/65535 allocated
    10. IPv6 address pool: 4/4294967295 allocated
    11. Controller Status: 20/20 healthy
    12. Proxy Status: OK, ip 10.0.28.238, port-range 10000-20000
    13. Hubble: Ok Current/Max Flows: 2542/4096 (62.06%), Flows/s: 164.21 Metrics: Disabled
    14. Cluster health: 2/2 reachable (2018-04-11T15:41:01Z)

    Observing Flows with Hubble

    Hubble is a built-in observability tool which allows you to inspect recent flow events on all endpoints managed by Cilium. It needs to be enabled via the Helm value global.hubble.enabled=true or the --enable-hubble option on cilium-agent.

    Observing flows of a specific pod

    In order to observe the traffic of a specific pod, you will first have to . The Hubble CLI is part of the Cilium container image and can be accessed via kubectl exec. The following query for example will show all events related to flows which either originated or terminated in the default/tiefighter pod in the last three minutes:

    1. $ kubectl exec -n kube-system cilium-77lk6 -- hubble observe --since 3m --pod default/tiefighter
    2. Jun 2 11:14:46.041 default/tiefighter:38314 kube-system/coredns-66bff467f8-ktk8c:53 to-endpoint FORWARDED UDP
    3. Jun 2 11:14:46.041 kube-system/coredns-66bff467f8-ktk8c:53 default/tiefighter:38314 to-endpoint FORWARDED UDP
    4. Jun 2 11:14:46.041 default/tiefighter:38314 kube-system/coredns-66bff467f8-ktk8c:53 to-endpoint FORWARDED UDP
    5. Jun 2 11:14:46.042 kube-system/coredns-66bff467f8-ktk8c:53 default/tiefighter:38314 to-endpoint FORWARDED UDP
    6. Jun 2 11:14:46.042 default/tiefighter:57746 default/deathstar-5b7489bc84-9bftc:80 L3-L4 FORWARDED TCP Flags: SYN
    7. Jun 2 11:14:46.042 default/deathstar-5b7489bc84-9bftc:80 default/tiefighter:57746 to-endpoint FORWARDED TCP Flags: SYN, ACK
    8. Jun 2 11:14:46.042 default/tiefighter:57746 default/deathstar-5b7489bc84-9bftc:80 to-endpoint FORWARDED TCP Flags: ACK
    9. Jun 2 11:14:46.043 default/tiefighter:57746 default/deathstar-5b7489bc84-9bftc:80 to-endpoint FORWARDED TCP Flags: ACK, PSH
    10. Jun 2 11:14:46.043 default/deathstar-5b7489bc84-9bftc:80 default/tiefighter:57746 to-endpoint FORWARDED TCP Flags: ACK, PSH
    11. Jun 2 11:14:46.043 default/tiefighter:57746 default/deathstar-5b7489bc84-9bftc:80 to-endpoint FORWARDED TCP Flags: ACK, FIN
    12. Jun 2 11:14:46.048 default/deathstar-5b7489bc84-9bftc:80 default/tiefighter:57746 to-endpoint FORWARDED TCP Flags: ACK, FIN
    13. Jun 2 11:14:46.048 default/tiefighter:57746 default/deathstar-5b7489bc84-9bftc:80 to-endpoint FORWARDED TCP Flags: ACK

    You may also use -o json to obtain more detailed information about each flow event.

    In the following example the first command extracts the numeric security identities for all dropped flows which originated in the default/xwing pod in the last three minutes. The numeric security identity can then be used together with the Cilium CLI to obtain more information about why flow was dropped:

    1. $ kubectl exec -n kube-system cilium-77lk6 -- \
    2. hubble observe --since 3m --type drop --from-pod default/xwing -o json | \
    3. jq .destination.identity | sort -u
    4. 788
    5. $ kubectl exec -n kube-system cilium-77lk6 -- \
    6. cilium policy trace --src-k8s-pod default:xwing --dst-identity 788
    7. ----------------------------------------------------------------
    8. Tracing From: [k8s:class=xwing, k8s:io.cilium.k8s.policy.cluster=default, k8s:io.cilium.k8s.policy.serviceaccount=default, k8s:io.kubernetes.pod.namespace=default, k8s:org=alliance] => To: [k8s:class=deathstar, k8s:io.cilium.k8s.policy.cluster=default, k8s:io.cilium.k8s.policy.serviceaccount=default, k8s:io.kubernetes.pod.namespace=default, k8s:org=empire] Ports: [0/ANY]
    9. Resolving ingress policy for [k8s:class=deathstar k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default k8s:org=empire]
    10. * Rule {"matchLabels":{"any:class":"deathstar","any:org":"empire","k8s:io.kubernetes.pod.namespace":"default"}}: selected
    11. Allows from labels {"matchLabels":{"any:org":"empire","k8s:io.kubernetes.pod.namespace":"default"}}
    12. No label match for [k8s:class=xwing k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default k8s:org=alliance]
    13. 1/1 rules selected
    14. Found no allow rule
    15. Ingress verdict: denied
    16. Final verdict: DENIED

    Please refer to the policy troubleshooting guide for more detail about how to troubleshoot policy related drops.

    Note

    Hubble Relay (beta) allows you to query multiple Hubble instances simultaneously without having to first manually target a specific node. See for more information.

    Ensure Hubble is running correctly

    To ensure the Hubble client can connect to the Hubble server running inside Cilium, you may use the command:

    1. $ hubble status
    2. Healthcheck (via unix:///var/run/cilium/hubble.sock): Ok
    3. Max Flows: 4096
    4. Current Flows: 2542 (62.06%)

    cilium-agent must be running with the --enable-hubble option in order for the Hubble server to be enabled. When deploying Cilium with Helm, make sure to set the global.hubble.enabled=true value.

    To check if Hubble is enabled in your deployment, you may look for the following output in cilium status:

    1. $ cilium status
    2. ...
    3. Hubble: Ok Current/Max Flows: 2542/4096 (62.06%), Flows/s: 164.21 Metrics: Disabled
    4. ...

    Note

    Pods need to be managed by Cilium in order to be observable by Hubble. See how to for more details.

    Observing flows with Hubble Relay

    Note

    Hubble Relay is beta software and as such is not yet considered production ready.

    Hubble Relay is a service which allows to query multiple Hubble instances simultaneously and aggregate the results. As Hubble Relay relies on individual Hubble instances, Hubble needs to be enabled when deploying Cilium. In addition, the Hubble service needs to be exposed on TCP port 4244. This can be done via the Helm values global.hubble.enabled=true and global.hubble.listenAddress=":4244" or the --enable-hubble --hubble-listen-address :4244 options on cilium-agent.

    Note

    Enabling Hubble to listen on TCP port 4244 globally has security implications as the service can be accessed without any restriction.

    Hubble Relay can be deployed using Helm by setting global.hubble.relay.enabled=true. This will deploy Hubble Relay with one replica by default. Once the Hubble Relay pod is running, you may access the service by port-forwarding it:

    1. $ kubectl -n kube-system port-forward service/hubble-relay 4245:80

    This will forward the Hubble Relay service port (80) to your local machine on port 4245. The next step consists of downloading the latest binary release of Hubble CLI from the . Make sure to download the tarball for your platform, verify the checksum and extract the hubble binary from the tarball. Optionally, add the binary to your $PATH if using Linux or MacOS or your %PATH% if using Windows.

    You can verify that Hubble Relay can be reached by running the following command:

    This command should return an output similar to the following:

    1. Healthcheck (via localhost:4245): Ok
    2. Max Flows: 16384
    3. Current Flows: 16384 (100.00%)

    For convenience, you may set and export the HUBBLE_DEFAULT_SOCKET_PATH environment variable:

    1. $ export HUBBLE_DEFAULT_SOCKET_PATH=localhost:4245

    This will allow you to use hubbble status and hubble observe commands without having to specify the server address via the --server flag.

    As Hubble Relay shares the same API as individual Hubble instances, you may follow the Observing flows with Hubble section keeping in mind that limitations with regards to what can be seen from individual Hubble instances no longer apply.

    Cilium connectivity tests

    The Cilium connectivity test deploys a series of services, deployments, and CiliumNetworkPolicy which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations.

    Note

    The connectivity tests this will only work in a namespace with no other pods or network policies applied. If there is a Cilium Clusterwide Network Policy enabled, that may also break this connectivity check.

    To run the connectivity tests create an isolated test namespace called cilium-test to deploy the tests with.

    1. $ kubectl create ns cilium-test
    2. $ kubectl apply --namespace=cilium-test -f https://raw.githubusercontent.com/cilium/cilium/v1.8/examples/kubernetes/connectivity-check/connectivity-check.yaml

    The tests cover various functionality of the system. Below we call out each test type. If tests pass, it suggests functionality of the referenced subsystem.

    The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

    1. $ kubectl get pods -n cilium-test
    2. NAME READY STATUS RESTARTS AGE
    3. echo-a-6788c799fd-42qxx 1/1 Running 0 69s
    4. echo-b-59757679d4-pjtdl 1/1 Running 0 69s
    5. echo-b-host-f86bd784d-wnh4v 1/1 Running 0 68s
    6. host-to-b-multi-node-clusterip-585db65b4d-x74nz 1/1 Running 0 68s
    7. host-to-b-multi-node-headless-77c64bc7d8-kgf8p 1/1 Running 0 67s
    8. pod-to-a-allowed-cnp-87b5895c8-bfw4x 1/1 Running 0 68s
    9. pod-to-a-b76ddb6b4-2v4kb 1/1 Running 0 68s
    10. pod-to-a-denied-cnp-677d9f567b-kkjp4 1/1 Running 0 68s
    11. pod-to-b-intra-node-nodeport-8484fb6d89-bwj8q 1/1 Running 0 68s
    12. pod-to-b-multi-node-clusterip-f7655dbc8-h5bwk 1/1 Running 0 68s
    13. pod-to-b-multi-node-headless-5fd98b9648-5bjj8 1/1 Running 0 68s
    14. pod-to-b-multi-node-nodeport-74bd8d7bd5-kmfmm 1/1 Running 0 68s
    15. pod-to-external-1111-7489c7c46d-jhtkr 1/1 Running 0 68s
    16. pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-97p75 1/1 Running 0 68s

    Information about test failures can be determined by describing a failed test pod

    1. $ kubectl describe pod pod-to-b-intra-node-hostport
    2. Warning Unhealthy 6s (x6 over 56s) kubelet, agent1 Readiness probe failed: curl: (7) Failed to connect to echo-b-host-headless port 40000: Connection refused
    3. Warning Unhealthy 2s (x3 over 52s) kubelet, agent1 Liveness probe failed: curl: (7) Failed to connect to echo-b-host-headless port 40000: Connection refused

    Checking cluster connectivity health

    By default when Cilium is run, it launches instances of cilium-health in the background to determine the overall connectivity status of the cluster. This tool periodically runs bidirectional traffic across multiple paths through the cluster and through each node using different protocols to determine the health status of each path and protocol. At any point in time, cilium-health may be queried for the connectivity status of the last probe.

    1. $ kubectl -n kube-system exec -ti cilium-2hq5z -- cilium-health status
    2. Probe time: 2018-06-16T09:51:58Z
    3. Nodes:
    4. ip-172-0-52-116.us-west-2.compute.internal (localhost):
    5. Host connectivity to 172.0.52.116:
    6. ICMP to stack: OK, RTT=315.254µs
    7. HTTP to agent: OK, RTT=368.579µs
    8. Endpoint connectivity to 10.2.0.183:
    9. ICMP to stack: OK, RTT=190.658µs
    10. HTTP to agent: OK, RTT=536.665µs
    11. ip-172-0-117-198.us-west-2.compute.internal:
    12. Host connectivity to 172.0.117.198:
    13. ICMP to stack: OK, RTT=1.009679ms
    14. HTTP to agent: OK, RTT=1.808628ms
    15. Endpoint connectivity to 10.2.1.234:
    16. ICMP to stack: OK, RTT=1.016365ms
    17. HTTP to agent: OK, RTT=2.29877ms

    For each node, the connectivity will be displayed for each protocol and path, both to the node itself and to an endpoint on that node. The latency specified is a snapshot at the last time a probe was run, which is typically once per minute. The ICMP connectivity row represents Layer 3 connectivity to the networking stack, while the HTTP connectivity row represents connection to an instance of the cilium-health agent running on the host or as an endpoint.

    Sometimes you may experience broken connectivity, which may be due to a number of different causes. A main cause can be unwanted packet drops on the networking level. The tool cilium monitor allows you to quickly inspect and see if and where packet drops happen. Following is an example output (use kubectl exec as in previous examples if running with Kubernetes):

    1. $ kubectl -n kube-system exec -ti cilium-2hq5z -- cilium monitor --type drop
    2. Listening for events on 2 CPUs with 64x4096 of shared memory
    3. Press Ctrl-C to quit
    4. xx drop (Policy denied) to endpoint 25729, identity 261->264: fd02::c0a8:210b:0:bf00 -> fd02::c0a8:210b:0:6481 EchoRequest
    5. xx drop (Policy denied) to endpoint 25729, identity 261->264: 10.11.13.37 -> 10.11.101.61 EchoRequest
    6. xx drop (Policy denied) to endpoint 25729, identity 261->264: 10.11.13.37 -> 10.11.101.61 EchoRequest
    7. xx drop (Invalid destination mac) to endpoint 0, identity 0->0: fe80::5c25:ddff:fe8e:78d8 -> ff02::2 RouterSolicitation

    The above indicates that a packet to endpoint ID 25729 has been dropped due to violation of the Layer 3 policy.

    Handling drop (CT: Map insertion failed)

    If connectivity fails and cilium monitor --type drop shows xx drop (CT: Map insertion failed), then it is likely that the connection tracking table is filling up and the automatic adjustment of the garbage collector interval is insufficient. Set --conntrack-gc-interval to an interval lower than the default. Alternatively, the value for bpf-ct-global-any-max and bpf-ct-global-tcp-max can be increased. Setting both of these options will be a trade-off of CPU for conntrack-gc-interval, and for and bpf-ct-global-tcp-max the amount of memory consumed.

    Enabling datapath debug messages

    By default, datapath debug messages are disabled, and therefore not shown in cilium monitor -v output. To enable them, add "datapath" to the debug-verbose option.

    Policy Troubleshooting

    Ensure pod is managed by Cilium

    A potential cause for policy enforcement not functioning as expected is that the networking of the pod selected by the policy is not being managed by Cilium. The following situations result in unmanaged pods:

    • The pod is running in host networking and will use the host’s IP address directly. Such pods have full network connectivity but Cilium will not provide security policy enforcement for such pods.
    • The pod was started before Cilium was deployed. Cilium only manages pods that have been deployed after Cilium itself was started. Cilium will not provide security policy enforcement for such pods.

    If pod networking is not managed by Cilium. Ingress and egress policy rules selecting the respective pods will not be applied. See the section for more details.

    You can run the following script to list the pods which are not managed by Cilium:

    1. $ ./contrib/k8s/k8s-unmanaged.sh
    2. kube-system/cilium-hqpk7
    3. kube-system/kube-addon-manager-minikube
    4. kube-system/kube-dns-54cccfbdf8-zmv2c
    5. kube-system/kubernetes-dashboard-77d8b98585-g52k5
    6. kube-system/storage-provisioner

    See section Policy Tracing for details and examples on how to use the policy tracing feature.

    Understand the rendering of your policy

    There are always multiple ways to approach a problem. Cilium can provide the rendering of the aggregate policy provided to it, leaving you to simply compare with what you expect the policy to actually be rather than search (and potentially overlook) every policy. At the expense of reading a very large dump of an endpoint, this is often a faster path to discovering errant policy requests in the Kubernetes API.

    Start by finding the endpoint you are debugging from the following list. There are several cross references for you to use in this list, including the IP address and pod labels:

    1. kubectl -n kube-system exec -ti cilium-q8wvt -- cilium endpoint list

    When you find the correct endpoint, the first column of every row is the endpoint ID. Use that to dump the full endpoint information:

    1. kubectl -n kube-system exec -ti cilium-q8wvt -- cilium endpoint get 59084

    Importing this dump into a JSON-friendly editor can help browse and navigate the information here. At the top level of the dump, there are two nodes of note:

    • spec: The desired state of the endpoint
    • status: The current state of the endpoint

    This is the standard Kubernetes control loop pattern. Cilium is the controller here, and it is iteratively working to bring the status in line with the spec.

    Opening the status, we can drill down through policy.realized.l4. Do your ingress and egress rules match what you expect? If not, the reference to the errant rules can be found in the derived-from-rules node.

    etcd (kvstore)

    Introduction

    Cilium can be operated in CRD-mode and kvstore/etcd mode. When cilium is running in kvstore/etcd mode, the kvstore becomes a vital component of the overall cluster health as it is required to be available for several operations.

    Operations for which the kvstore is strictly required when running in etcd mode:

    Scheduling of new workloads:

    As part of of scheduling workloads/endpoints, agents will perform security identity allocation which requires interaction with the kvstore. If a workload can be scheduled due to re-using a known security identity, then state propagation of the endpoint details to other nodes will still depend on the kvstore and thus policy drops may be observed as other nodes in the cluster will not be aware of the new workload.

    Multi cluster:

    All state propagation between clusters depends on the kvstore.

    Node discovery:

    New nodes require to register themselves in the kvstore.

    Agent bootstrap:

    The Cilium agent will eventually fail if it can’t connect to the kvstore at bootstrap time, however, the agent will still perform all possible operations while waiting for the kvstore to appear.

    Operations which do not require kvstore availability:

    All datapath operations:

    All datapath forwarding, policy enforcement and visibility functions for existing workloads/endpoints do not depend on the kvstore. Packets will continue to be forwarded and network policy rules will continue to be enforced.

    However, if the agent requires to restart as part of the Recovery behavior, there can be delays in:

    • processing of flow events and metrics
    • short unavailability of layer 7 proxies

    NetworkPolicy updates:

    Network policy updates will continue to be processed and applied.

    Services updates:

    All updates to services will be processed and applied.

    Understanding etcd status

    The etcd status is reported when running cilium status. The following line represents the status of etcd:

    1. KVStore: Ok etcd: 1/1 connected, lease-ID=29c6732d5d580cb5, lock lease-ID=29c6732d5d580cb7, has-quorum=true: https://192.168.33.11:2379 - 3.4.9 (Leader)

    OK:

    The overall status. Either OK or Failure.

    1/1 connected:

    Number of total etcd endpoints and how many of them are reachable.

    lease-ID:

    UUID of the lease used for all keys owned by this agent.

    UUID of the lease used for locks acquired by this agent.

    has-quorum:

    Status of etcd quorum. Either true or set to an error.

    consecutive-errors:

    Number of consecutive quorum errors. Only printed if errors are present.

    https://192.168.33.11:2379 - 3.4.9 (Leader):

    List of all etcd endpoints stating the etcd version and whether the particular endpoint is currently the elected leader. If an etcd endpoint cannot be reached, the error is shown.

    Recovery behavior

    In the event of an etcd endpoint becoming unhealthy, etcd should automatically resolve this by electing a new leader and by failing over to a healthy etcd endpoint. As long as quorum is preserved, the etcd cluster will remain functional.

    In addition, Cilium performs a background check in an interval to determine etcd health and potentially take action. The interval depends on the overall cluster size. The larger the cluster, the longer the interval:

    Example of a status with a quorum failure which has not yet reached the threshold:

    Example of a status with the number of quorum failures exceeding the threshold:

    1. KVStore: Failure Err: quorum check failed 8 times in a row: 4m28.446600949s since last heartbeat update has been received

    Symptom

    Endpoint to endpoint communication on a single node succeeds but communication fails between endpoints across multiple nodes.

    Troubleshooting steps:

    1. Run cilium-health status on the node of the source and destination endpoint. It should describe the connectivity from that node to other nodes in the cluster, and to a simulated endpoint on each other node. Identify points in the cluster that cannot talk to each other. If the command does not describe the status of the other node, there may be an issue with the KV-Store.
    2. Run cilium monitor on the node of the source and destination endpoint. Look for packet drops.

    When running in mode:

    1. Run cilium bpf tunnel list and verify that each Cilium node is aware of the other nodes in the cluster. If not, check the logfile for errors.

    2. If nodes are being populated correctly, run tcpdump -n -i cilium_vxlan on each node to verify whether cross node traffic is being forwarded correctly between nodes.

      If packets are being dropped,

      • verify that the node IP listed in cilium bpf tunnel list can reach each other.
      • verify that the firewall on each node allows UDP port 8472.

    When running in Native-Routing mode:

    1. Run ip route or check your cloud provider router and verify that you have routes installed to route the endpoint prefix between all nodes.
    2. Verify that the firewall on each node permits to route the endpoint IPs.

    Useful Scripts

    Retrieve Cilium pod managing a particular pod

    Identifies the Cilium pod that is managing a particular pod in a namespace:

    1. k8s-get-cilium-pod.sh <pod> <namespace>

    Example:

    1. $ curl -sLO releases.cilium.io/v1.1.0/tools/k8s-get-cilium-pod.sh
    2. $ ./k8s-get-cilium-pod.sh luke-pod default
    3. cilium-zmjj9

    Execute a command in all Kubernetes Cilium pods

    Run a command within all Cilium pods of a cluster

    1. k8s-cilium-exec.sh <command>

    Example:

    1. $ curl -sLO releases.cilium.io/v1.1.0/tools/k8s-cilium-exec.sh
    2. $ ./k8s-cilium-exec.sh uptime
    3. 10:15:16 up 6 days, 7:37, 0 users, load average: 0.00, 0.02, 0.00
    4. 10:15:16 up 6 days, 7:32, 0 users, load average: 0.00, 0.03, 0.04
    5. 10:15:16 up 6 days, 7:30, 0 users, load average: 0.75, 0.27, 0.15
    6. 10:15:16 up 6 days, 7:28, 0 users, load average: 0.14, 0.04, 0.01

    List unmanaged Kubernetes pods

    Lists all Kubernetes pods in the cluster for which Cilium does not provide networking. This includes pods running in host-networking mode and pods that were started before Cilium was deployed.

    1. k8s-unmanaged.sh

    Example:

    1. $ curl -sLO releases.cilium.io/v1.1.0/tools/k8s-unmanaged.sh
    2. $ ./k8s-unmanaged.sh
    3. kube-system/cilium-hqpk7
    4. kube-system/kube-addon-manager-minikube
    5. kube-system/kube-dns-54cccfbdf8-zmv2c
    6. kube-system/kubernetes-dashboard-77d8b98585-g52k5
    7. kube-system/storage-provisioner

    Reporting a problem

    Automatic log & state collection

    Before you report a problem, make sure to retrieve the necessary information from your cluster before the failure state is lost. Cilium provides a script to automatically grab logs and retrieve debug information from all Cilium pods in the cluster.

    The script has the following list of prerequisites:

    • Requires Python >= 2.7.*
    • Requires kubectl.
    • kubectl should be pointing to your cluster before running the tool.

    You can download the latest version of the cilium-sysdump tool using the following command:

    1. curl -sLO https://github.com/cilium/cilium-sysdump/releases/latest/download/cilium-sysdump.zip
    2. python cilium-sysdump.zip

    You can specify from which nodes to collect the system dumps by passing node IP addresses via the --nodes argument:

    1. python cilium-sysdump.zip --nodes=$NODE1_IP,$NODE2_IP2

    Use --help to see more options:

    1. python cilium-sysdump.zip --help

    Single Node Bugtool

    If you are not running Kubernetes, it is also possible to run the bug collection tool manually with the scope of a single node:

    The cilium-bugtool captures potentially useful information about your environment for debugging. The tool is meant to be used for debugging a single Cilium agent node. In the Kubernetes case, if you have multiple Cilium pods, the tool can retrieve debugging information from all of them. The tool works by archiving a collection of command output and files from several places. By default, it writes to the tmp directory.

    Note that the command needs to be run from inside the Cilium pod/container.

    1. $ cilium-bugtool

    When running it with no option as shown above, it will try to copy various files and execute some commands. If kubectl is detected, it will search for Cilium pods. The default label being k8s-app=cilium, but this and the namespace can be changed via k8s-namespace and k8s-label respectively.

    If you want to capture the archive from a Kubernetes pod, then the process is a bit different

    Note

    Please check the archive for sensitive information and strip it away before sharing it with us.

    Below is an approximate list of the kind of information in the archive.

    • Cilium status
    • Cilium version
    • Kernel configuration
    • Resolve configuration
    • Cilium endpoint state
    • Cilium logs
    • Docker logs
    • dmesg
    • ethtool
    • ip a
    • ip link
    • ip r
    • iptables-save
    • kubectl -n kube-system get pods
    • kubectl get pods,svc for all namespaces
    • uname
    • uptime
    • cilium bpf * list
    • cilium endpoint get for each endpoint
    • cilium endpoint list
    • hostname
    • cilium policy get
    • cilium service list

    Debugging information

    If you are not running Kubernetes, you can use the cilium debuginfo command to retrieve useful debugging information. If you are running Kubernetes, this command is automatically run as part of the system dump.

    cilium debuginfo can print useful output from the Cilium API. The output format is in Markdown format so this can be used when reporting a bug on the . Running without arguments will print to standard output, but you can also redirect to a file like

    1. $ cilium debuginfo -f debuginfo.md

    Note

    Please check the debuginfo file for sensitive information and strip it away before sharing it with us.

    Slack Assistance

    The Cilium slack community is helpful first point of assistance to get help troubleshooting a problem or to discuss options on how to address a problem.

    The slack community is open to everyone. You can request an invite email by visiting .