Skip to content

Latest commit

 

History

History
232 lines (189 loc) · 12.2 KB

troubleshooting.md

File metadata and controls

232 lines (189 loc) · 12.2 KB

Troubleshooting

This section provides information on diagnostics and troubleshooting.

If you experience issues with the CEX device plug-in, you can check the pod status, gather pod diagnostics, and collect debugging data.

Prerequisites

You must log in as a user that belongs to a role with administrative privileges for the cluster. For example, system:admin or kube:admin.

Verification

The CEX device plug-in runs as a daemonset in namespace cex-device-plugin.

The following query should list the CEX device plug-in daemonset:

$ kubectl get daemonsets -n cex-device-plugin
NAME                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
cex-plugin-daemonset   3         3         3       3            3           <none>          4d

The daemonset is realized as a pod with one container per each compute node.

To review the status of pods running in the CEX device plug-in and the kube-system namespace, run the following command:

$ kubectl get pods -n cex-device-plugin
NAME                         READY   STATUS    RESTARTS   AGE
cex-plugin-daemonset-bfxt2   1/1     Running   0          3d23h
cex-plugin-daemonset-bhhj8   1/1     Running   0          3d23h
cex-plugin-daemonset-bntsp   1/1     Running   0          3d23h

Verify that the pods are running correctly. There should be one pod per each compute node in status Running. If one or more of the CEX device plug-in pods do not show up or are not showing a Running status, you can collect diagnostic information.

To inspect the status of a pod in detail, use the describe subcommand, for example:

$ kubectl describe pod cex-plugin-daemonset-bfxt2 -n cex-device-plugin

When these requirements are fulfilled, ensure that you have a CEX resource configuration map, which defines the CEX config sets deployed in namespace cex-device-plugin:

$ kubectl get configmap -n cex-device-plugin
NAME                                 DATA   AGE
...                                  ...    ...
cex-resources-config                 1      4d2h
...                                  ...    ...

To verify if a configmap has been deployed, run kubectl describe on one of the plug-in pods. If no configmap is deployed the output will show a message that the volume mount failed, for example:

MountVolume.SetUp failed for volume "cex-resources-conf" : configmap "cex-resources-config" not found

If the CEX configmap is deployed and the CEX device plug-in instances are running, verify the available and allocated CEX resources on each compute node:

$ kubectl describe nodes
...
Allocatable:
  ...
  cex.s390.ibm.com/Accel:                1
  cex.s390.ibm.com/CCA_for_customer_1:   3
  cex.s390.ibm.com/EP11_for_customer_2:  2
  cpu:                                   3500m
  ephemeral-storage:                     15562841677
  ...
...
Allocated resources:
  Resource                              Requests      Limits
  --------                              --------      ------
  cpu                                   408m (11%)    0 (0%)
  memory                                2213Mi (20%)  0 (0%)
  ephemeral-storage                     0 (0%)        0 (0%)
  hugepages-1Mi                         0 (0%)        0 (0%)
  cex.s390.ibm.com/Accel                0             0
  cex.s390.ibm.com/CCA_for_customer_1   1             1
  cex.s390.ibm.com/EP11_for_customer_2  0             0
  ...

Each CEX device plug-in pod provides log messages, which provide details that might explain a possible failure or misbehavior. The logs of each of the CEX device plug-in instances can be extracted with the following command sequence:

$ kubectl get pods -n cex-device-plugin --no-headers | grep cex-plugin-daemonset
cex-plugin-daemonset-p5j8h   1/1   Running   0     32m
cex-plugin-daemonset-qdz8r   1/1   Running   0     32m
cex-plugin-daemonset-zxwts   1/1   Running   0     32m
$ kubectl logs -n cex-device-plugin cex-plugin-daemonset-p5j8h
$ kubectl logs -n cex-device-plugin cex-plugin-daemonset-qdz8r
$ kubectl logs -n cex-device-plugin cex-plugin-daemonset-zxwts

Here are some important parts of a sample CEX device plug-in log shown with some explanations:

 1: 2022/06/07 14:05:18 Main: S390 k8s z crypto resources plugin starting
 2: 2022/06/07 14:05:18 Plugin Version: v1.0.2
 3: 2022/06/07 14:05:18 Git URL:        https://github.com/ibm-s390-cloud/k8s-cex-dev-plugin.git
 4: 2022/06/07 14:05:18 Git Commit:     40fae46c3d3aacff055d5f2fd7e1c580abc850b9

Line 2: CEX device plug-in version.

Line 3-4: Source code and commit id base for this CEX device plug-in application.

 5: 2022/06/07 14:05:18 Main: Machine id is 'IBM-3906-00000000000DA1E7'
 6: 2022/06/07 14:05:18 Ap: apScanAPQNs() found 4 APQNs: (6,51,cex6,accel,true), (8,51,cex6,cca,true), (9,51,cex6,cca,true), (10,51,cex6,ep11,true)
 7: 2022/06/07 14:05:18 CryptoConfig: Configuration changes detected
 8: 2022/06/07 14:05:18 CryptoConfig: Configuration successful updated
 9: 2022/06/07 14:05:18 Main: Crypto configuration successful read
10: 2022/06/07 14:05:18 CryptoConfig (3 CryptoConfigSets):
11: 2022/06/07 14:05:18   setname: 'CCA_for_customer_1'
12: 2022/06/07 14:05:18     project: 'customer_1'
13: 2022/06/07 14:05:18     5 equvialent APQNs:
14: 2022/06/07 14:05:18       APQN adapter=4 domain=51 machineid='*'
15: 2022/06/07 14:05:18       APQN adapter=8 domain=51 machineid='*'
16: 2022/06/07 14:05:18       APQN adapter=9 domain=51 machineid='*'
17: 2022/06/07 14:05:18       APQN adapter=12 domain=51 machineid='*'
18: 2022/06/07 14:05:18       APQN adapter=13 domain=51 machineid='*'
19: 2022/06/07 14:05:18   setname: 'EP11_for_customer_2'
20: 2022/06/07 14:05:18     project: 'customer_1'
21: 2022/06/07 14:05:18     3 equvialent APQNs:
22: 2022/06/07 14:05:18       APQN adapter=5 domain=51 machineid='*'
23: 2022/06/07 14:05:18       APQN adapter=10 domain=51 machineid='*'
24: 2022/06/07 14:05:18       APQN adapter=11 domain=51 machineid='*'
25: 2022/06/07 14:05:18   setname: 'Accel'
26: 2022/06/07 14:05:18     project: 'default'
27: 2022/06/07 14:05:18     3 equvialent APQNs:
28: 2022/06/07 14:05:18       APQN adapter=3 domain=51 machineid='*'
29: 2022/06/07 14:05:18       APQN adapter=6 domain=51 machineid='*'
30: 2022/06/07 14:05:18       APQN adapter=7 domain=51 machineid='*'

Line 6: The list of APQNs found by the CEX device plug-in instance on the compute node.

Lines 10-30: Condensed view of the CEX resource configuration.

...
40: 2022/06/07 14:05:18 PodLister: Start()
41: 2022/06/07 14:05:18 Plugin: Register plugins for these CryptoConfigSets: [Accel CCA_for_customer_1 EP11_for_customer_2]
42: 2022/06/07 14:05:18 Plugin: Announcing 'cex.s390.ibm.com' as our resource namespace
43: 2022/06/07 14:05:18 Plugin: NewPlugin('EP11_for_customer_2')
44: 2022/06/07 14:05:18 Plugin['EP11_for_customer_2']: Start()
45: 2022/06/07 14:05:18 Plugin: Announcing 'cex.s390.ibm.com' as our resource namespace
46: 2022/06/07 14:05:18 Plugin: NewPlugin('Accel')
47: 2022/06/07 14:05:18 Plugin['Accel']: Start()
48: 2022/06/07 14:05:18 Plugin: Announcing 'cex.s390.ibm.com' as our resource namespace
49: 2022/06/07 14:05:18 Plugin: NewPlugin('CCA_for_customer_1')
50: 2022/06/07 14:05:18 Plugin['CCA_for_customer_1']: Start()
51: 2022/06/07 14:05:18 Plugin['Accel']: Found 1 eligible APQNs: (6,51,cex6,accel,true)
52: 2022/06/07 14:05:18 Plugin['Accel']: Overcommit not specified in ConfigSet, fallback to 1
53: 2022/06/07 14:05:18 Plugin['Accel']: Derived 1 plugin devices from the list of APQNs
54: 2022/06/07 14:05:18 Plugin['EP11_for_customer_2']: Found 1 eligible APQNs: (10,51,cex6,ep11,true)
55: 2022/06/07 14:05:18 Plugin['EP11_for_customer_2']: Overcommit not specified in ConfigSet, fallback to 1
56: 2022/06/07 14:05:18 Plugin['EP11_for_customer_2']: Derived 1 plugin devices from the list of APQNs
57: 2022/06/07 14:05:18 Plugin['CCA_for_customer_1']: Found 2 eligible APQNs: (8,51,cex6,cca,true), (9,51,cex6,cca,true)
58: 2022/06/07 14:05:18 Plugin['CCA_for_customer_1']: Overcommit not specified in ConfigSet, fallback to 1
59: 2022/06/07 14:05:18 Plugin['CCA_for_customer_1']: Derived 2 plugin devices from the list of APQNs
...

Lines 51, 54, 57: List of APQNs from the different CEX config sets that have been found on the compute node and are allocatable.

The following example shows a real allocation by a container:

...
70: 2022/06/07 14:17:03 Plugin['CCA_for_customer_1']: Allocate(request=&AllocateRequest{ContainerRequests:
		    []*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[apqn-9-51-0],},},})
71: 2022/06/07 14:17:03 Plugin['CCA_for_customer_1']: creating zcrypt device node 'zcrypt-apqn-9-51-0'
72: 2022/06/07 14:17:03 Zcrypt: Successfully created new zcrypt device node 'zcrypt-apqn-9-51-0'
73: 2022/06/07 14:17:03 Zcrypt: simple node 'zcrypt-apqn-9-51-0' for APQN(9,51) created
74: 2022/06/07 14:17:03 Shadowsysfs: shadow dir /var/tmp/shadowsysfs/sysfs-apqn-9-51-0 created
75: 2022/06/07 14:17:03 Plugin['CCA_for_customer_1']: Allocate() response=&AllocateResponse{ContainerResponses:
		    []*ContainerAllocateResponse{&ContainerAllocateResponse{Envs:map[string]string{},Mounts:[]*Mount{&Mount{ContainerPath:/sys/bus/ap,HostPath:/var/tmp/shadowsysfs/sysfs-apqn-9-51-0/bus/ap,ReadOnly:true,},&Mount{ContainerPath:/sys/devices/ap,HostPath:/var/tmp/shadowsysfs/sysfs-apqn-9-51-0/devices/ap,ReadOnly:true,},},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/z90crypt,HostPath:/dev/zcrypt-apqn-9-51-0,Permissions:rw,},},Annotations:map[string]string{},},},}
...

About every 30 seconds the list of running containers with allocated CEX resources is listed:

...
80: 2022/06/07 14:47:18 PodLister: 1 active zcrypt nodes
81: 2022/06/07 14:47:18 PodLister: 1 active sysfs shadow dirs
82: 2022/06/07 14:47:18 PodLister: Container 'cex-testload-1' in namespace 'default' uses CEX resource 'apqn-9-51-0' marked for project 'customer_1'!!!
83: 2022/06/07 14:47:18 PodLister: 1 active containers with allocated cex devices
...

When containers terminate with an allocated CEX resource there is a cleanup step, which is reported in the log as follows:

...
90: 2022/06/07 14:52:18 PodLister: 1 active zcrypt nodes
91: 2022/06/07 14:52:18 PodLister: 1 active sysfs shadow dirs
92: 2022/06/07 14:52:18 PodLister: 0 active containers with allocated cex devices
93: 2022/06/07 14:52:18 PodLister: deleting zcrypt node 'zcrypt-apqn-9-51-0': no container use since 120 s
94: 2022/06/07 14:52:18 PodLister: deleting shadow sysfs 'sysfs-apqn-9-51-0': no container use since 120 s
...

Capturing debug data for support

If you submit a support case, provide debugging data. Describe the failure and the expected behavior and collect the logs of all CEX device plug-in instances together with the currently active CEX resource configuration map. Optionally, you can include the output of kubectl describe nodes. Be careful when providing this node data as internals of the load on the cluster might be exposed.

For example, run following commands to collect the required information:

$ cd /tmp
$ for p in `kubectl get pods -n cex-device-plugin --no-headers | grep cex-plugin-daemonset | awk '{print $1}'`; do \
kubectl logs -n cex-device-plugin $p >$p.log; done
$ kubectl get configmap -n cex-device-plugin cex-resources-config -o yaml >cex-resources-config.yaml
$ kubectl describe nodes >describe_nodes.log
$ zip debugdata.zip cex-plugin-daemonset-*.log cex-resources-config.yaml describe_nodes.log
$ rm cex-plugin-daemonset-*.log cex-resources-config.yaml describe_nodes.log

Note: The CEX device plug-in does not have access to any cluster or application secrets. Therefore, only administrative information, related to the APQNs that are managed by the plug-in, is logged. The logs contain the name of the configuration sets and the name and namespace of pods that request and use APQNs. Since no application, cluster, or company secrets are contained within the logs, it is safe to hand over this logging information to technical support.