Ceph Pacific:Detecting Configuration Anomalies

Paul Cuzner

The Ceph Pacific release introduces a new feature within the cephadm mgr module that makes use of the metadata that each Ceph host provides. The cephadm binary (your friend from the bootstrap process!) implements a gather-facts sub-command which examines the host, and assembles relevant metadata. This metadata then allows the orchestrator/cephadm logic to better understand the hosts supporting Ceph services and daemons and provides the basis for detecting configuration anomalies across your Ceph cluster.

You can take a look at the metadata that gather-facts generates by simply running the cephadm binary directly, like this;

[cephadmin@c8-node1 ~]# sudo cephadm gather-facts
  "arch": "x86_64",
  "bios_date": "04/01/2014",
  "bios_version": "1.14.0-3.fc34",
  "cpu_cores": 1,
  "cpu_count": 4,
  "cpu_load": {
    "15min": 0.79,
    "1min": 0.29,
    "5min": 0.35
  "cpu_model": "Intel Xeon Processor (Cooperlake)",
  "cpu_threads": 1,
  "flash_capacity": "0.0",
  "flash_capacity_bytes": 0,
  "flash_count": 0,
  "flash_list": [],
  "hdd_capacity": "3.3TB",
  "hdd_capacity_bytes": 3341484556288,
  "hdd_count": 4,

The cephadm mgr module invokes the gather-facts command regularly to maintain a 'picture' of each host’s configuration. Having this data available helps with scheduling decisions, and also presents an opportunity to try and get in front of any configuration anomalies, or drift, that could turn into performance or service impacting events. This is the focus of the config-check feature.

Whether the configuration checks are active is determined by the config_checks_enabled setting of the cephadm mgr module. By default it’s disabled. You can check the state of this setting by running the following command from the cephadm shell;

[ceph: root@c8-node1 /]# ceph config get mgr mgr/cephadm/config_checks_enabled

Even with the feature disabled, you can see the new commands that this feature introduces with the ceph cephadm -h command.

ceph cephadm config-check statusShow whether the configuration checker feature is enabled/disabled
ceph cephadm config-check lsList the available configuration checks and their current state
ceph cephadm config-check disable <check_name>Disable a specific configuration check
ceph cephadm config-check enable <check_name>Enable a specific configuration check

Some of these checks may make sense within your environment, others may not. The point is, the choice is yours. You can enable/disable the feature with a single command, and pick the checks that make sense to you.

Here’s an example where the software subscription checks, and kernel version consistency checks are disabled.

[ceph: root@c8-node1 /]# ceph cephadm config-check disable os_subscription
[ceph: root@c8-node1 /]# ceph cephadm config-check disable kernel_version
[ceph: root@c8-node1 /]# ceph cephadm config-check ls
NAME             HEALTHCHECK                      STATUS    DESCRIPTION
kernel_security  CEPHADM_CHECK_KERNEL_LSM         enabled   checks SELINUX/Apparmor profiles are consistent across cluster hosts
os_subscription  CEPHADM_CHECK_SUBSCRIPTION       disabled  checks subscription states are consistent for all cluster hosts
public_network   CEPHADM_CHECK_PUBLIC_MEMBERSHIP  enabled   check that all hosts have a NIC on the Ceph public_netork
osd_mtu_size     CEPHADM_CHECK_MTU                enabled   check that OSD hosts share a common MTU setting
osd_linkspeed    CEPHADM_CHECK_LINKSPEED          enabled   check that OSD hosts share a common linkspeed
network_missing  CEPHADM_CHECK_NETWORK_MISSING    enabled   checks that the cluster/public networks defined exist on the Ceph hosts
ceph_release     CEPHADM_CHECK_CEPH_RELEASE       enabled   check for Ceph version consistency - ceph daemons should be on the same
                                                            release (unless upgrade is active)
kernel_version   CEPHADM_CHECK_KERNEL_VERSION     disabled  checks that the MAJ.MIN of the kernel on Ceph hosts is consistent

Enabling the config-check feature is simple - just set the config_checks_enabled flag to true, like this;

[ceph: root@c8-node1 /]# ceph config set mgr mgr/cephadm/config_checks_enabled true
[ceph: root@c8-node1 /]# ceph cephadm config-check status

Now mgr/cephadm can analyse the host metadata automatically, according to the checks you have enabled. Any issues will generate a "WARN level" health alert that you’ll see with ceph -s. However, if every enabled check passes, all you'll see is a summary message in the ceph mgr log, like this;

CEPHADM 6/8 checks enabled and executed (0 bypassed, 2 disabled). No issues detected

If you want to find out more about what the configuration checks do, they’re described here

Not every environment will benefit from this feature, which is why it's disabled by default. However, trying it is simple and you never know - it may find something you weren't expecting! If you try it and decide it's not for you, just set config_checks_enabled back to false, and you're done.

Maybe you have ideas for new configuration checks, or perhaps ideas for metadata that would help identify other configuration anomalies. If so, we obviously welcome contributions in the form of PR's, but feel free to reach out on dev@ceph.io with any questions or ideas. Just prefix the subject of your email with "orchestrator/cephadm".