The Ceph Blog

Ceph blog stories provide high-level spotlights on our customers all over the world

June 18, 2019

New in Nautilus: RBD Performance Monitoring

Prior to Nautilus, Ceph storage administrators have not had access to any built-in RBD performance monitoring and metrics gathering tools. While a storage administrator could monitor high-level cluster or OSD IO metrics, oftentimes this was too coarse-grained to determine the source of noisy neighbor workloads running on top of RBD images. The best available workaround, assuming the storage administrator had access to all Ceph client nodes, was to poll the metrics from the client nodes via makeshift external tooling.

Nautilus now incorporates a generic metrics gathering framework within the OSDs and MGRs to provide built-in monitoring, and new RBD performance monitoring tools are built on top of this framework to translate individual RADOS object metrics into aggregated RBD image metrics for IOPS, throughput, and latency. These metrics are all generated and processed within the Ceph cluster itself, so there is no need for access to client nodes to scrape metrics.

Note that as a consequence of gathering the metrics directly from the OSDs, the latency metrics will not be end-to-end but instead only OSD internal.

Prometheus Exporter

The first Ceph tool where RBD image metrics can be extracted is via the built-in MGR Prometheus Exporter module. This module can be enabled by running the following (if not already enabled):

$ ceph ceph mgr module enable prometheus

Note that RBD metrics are not enabled by default by the Prometheus exporter. To enable the RBD metrics, you must provide the module with a list of RBD pools to export. For example, so export metrics for the “glance”, “cinder”, and “nova” pools:

$ ceph config set mgr mgr/prometheus/rbd_stats_pools glance,cinder,nova

The Prometheus exporter metrics will include read/write op and byte counters in addition to read/write latency counters. An example of a metric export for image “image0” in the “rbd” pool is provided below:

# TYPE ceph_rbd_write_ops counter ceph_rbd_write_ops{pool="rbd",namespace="",image="image0"} 684652.0 ...
# HELP ceph_rbd_read_ops RBD image reads count
# TYPE ceph_rbd_read_ops counter ceph_rbd_read_ops{pool="rbd",namespace="",image="image0"} 5175256.0
# HELP ceph_rbd_write_bytes RBD image bytes written
# TYPE ceph_rbd_write_bytes counter ceph_rbd_write_bytes{pool="rbd",namespace="",image="image0"} 3531403264.0
# HELP ceph_rbd_read_bytes RBD image bytes read
# TYPE ceph_rbd_read_bytes counter ceph_rbd_read_bytes{pool="rbd",namespace="",image="image0"} 242032795680.0 ...
# HELP ceph_rbd_write_latency_sum RBD image writes latency (msec) Total
# TYPE ceph_rbd_write_latency_sum counter ceph_rbd_write_latency_sum{pool="rbd",namespace="",image="image0"} 6886443555662.0
# HELP ceph_rbd_write_latency_count RBD image writes latency (msec) Count
# TYPE ceph_rbd_write_latency_count counter ceph_rbd_write_latency_count{pool="rbd",namespace="",image="image0"} 684652.0
# HELP ceph_rbd_read_latency_sum RBD image reads latency (msec) Total
# TYPE ceph_rbd_read_latency_sum counter ceph_rbd_read_latency_sum{pool="rbd",namespace="",image="image0"} 4371224159814.0
# HELP ceph_rbd_read_latency_count RBD image reads latency (msec) Count
# TYPE ceph_rbd_read_latency_count counter ceph_rbd_read_latency_count{pool="rbd",namespace="",image="image0"} 5175256.0

For additional details for how to configure and use the Prometheus exporter, refer to the module documentation.

Dashboard Integration

Additionally, the Ceph Dashboard’s “Block” tab now includes a new “Overall Performance” sub-tab which will display an embedded Grafana dashboard of high-level RBD metrics. This provides a quick at-a-glance view of the overall block workloads’ IOPS, throughput, and average latency. It also displays the top 10 images that are using the highest IOPS and throughput, and as well as the images with the highest request latency.

This Grafana dashboard depends on the metrics gathered by Prometheus from the MGR Prometheus exporter discussed above. For additional details for how to configure the Ceph Dashboard embedded Grafana dashboards, refer to the dashboard documentation.

Command-line Interface

Finally, for storage administrators that prefer to use the command-line, the rbd CLI tool has been augmented with two new actions: rbd perf image iotop and rbd perf image iostat.

The rbd CLI metrics aggregation is performed by a new MGR module, rbd_support, that must be enabled before first use:

$ ceph ceph mgr module enable rbd_support

The rbd perf image iotop provides an “iotop”-like view of the images with the highest usage by write ops, read-ops, write-bytes, read-bytes, write-latency, and read-latency. These categories are dynamically sortable by using the left and right arrow keys.

The rbd perf image iostat provides an “iostat”-like view of images, sorted by one of the available metrics. The output from this action can be formatted as JSON or XML for ingestion by other tools, and the sort column can be changed via a command-line optional.

Note that it might require around 30 seconds for metrics to populate upon first use.

Jason Dillaman