Metric Messages

Metrics are extracted from several sources:

  • Data received from collectd.
  • Log messages processed by the collector service.
  • OpenStack notifications processed by the collector service.

Metric Messages Format

In addition to the common Common Message Format, metric messages have additional properties.

Attributes in bold are always present in the messages while attributes in italic are optional.

  • Logger (string), the datasource from the Heka’s standpoint, it can be collectd, notification_processor or http_log_parser.
  • Type (string)
  • metric or heka.sandbox.metric for the single-value metrics.
  • heka.sandbox.multivalue_metric for the multi-valued metrics (eg annotations).
  • heka.sandbox.bulk_metric for the metrics sent by bulk.
  • heka.sandbox.afd_service_metric for the AFD service metrics.
  • heka.sandbox.afd_node_metric for the AFD node metrics.
  • heka.sandbox.gse_service_cluster_metric for the GSE service cluster metrics.
  • heka.sandbox.gse_node_cluster_metric for the GSE node cluster metrics.
  • heka.sandbox.gse_cluster_metric for the GSE global cluster metrics.
  • Severity (number), it is always equal to 6 (INFO).
  • Fields
  • name (string), the name of the metric. See List of metrics for the current metric names that are emitted.
  • value (number), the value associated to the metric.
  • type (string), the metric’s type, either gauge (a value that can go up or down), counter (an always increasing value) or derive (a per-second rate).
  • source (string), the source from where the metric comes from, it can be the name of the collectd plugin, <service>-api for HTTP response metrics.
  • hostname (string), the name of the host to which the metric applies. It may be different from the Hostname value. For instance when the metric is extracted from an OpenStack notification, Hostname is the host that captured the notification and Fields[hostname] is the host that emitted the notification.
  • interval (number), the interval at which the metric is emitted (for the collectd metrics).
  • tenant_id (string), the UUID of the OpenStack tenant to which the metric applies.
  • user_id (string), the UUID of the OpenStack user to which the metric applies.

Metric messages may include additional fields to specify the scope of the measurement. When this is the case, these fields are detailed in the list of metrics presented hereafter.

List of metrics

This is the list of metrics that are emitted by the LMA collector service. They are listed by category then by metric name.

System

CPU

Metrics have a cpu_number field that contains the CPU number to which the metric applies.

  • cpu_idle, percentage of CPU time spent in the idle task.
  • cpu_interrupt, percentage of CPU time spent servicing interrupts.
  • cpu_nice, percentage of CPU time spent in user mode with low priority (nice).
  • cpu_softirq, percentage of CPU time spent servicing soft interrupts.
  • cpu_steal, percentage of CPU time spent in other operating systems.
  • cpu_system, percentage of CPU time spent in system mode.
  • cpu_user, percentage of CPU time spent in user mode.
  • cpu_wait, percentage of CPU time spent waiting for I/O operations to complete.

Disk

Metrics have a device field that contains the disk device number to which the metric applies (eg ‘sda’, ‘sdb’ and so on).

  • disk_merged_read, the number of read operations per second that could be merged with already queued operations.
  • disk_merged_write, the number of write operations per second that could be merged with already queued operations.
  • disk_octets_read, the number of octets (bytes) read per second.
  • disk_octets_write, the number of octets (bytes) written per second.
  • disk_ops_read, the number of read operations per second.
  • disk_ops_write, the number of write operations per second.
  • disk_time_read, the average time for a read operation to complete in the last interval.
  • disk_time_write, the average time for a write operation to complete in the last interval.

File system

Metrics have a fs field that contains the partition’s mount point to which the metric applies (eg ‘/’, ‘/var/lib’ and so on).

  • fs_inodes_free, the number of free inodes on the file system.
  • fs_inodes_reserved, the number of reserved inodes.
  • fs_inodes_used, the number of used inodes.
  • fs_space_free, the number of free bytes.
  • fs_space_reserved, the number of reserved bytes.
  • fs_space_used, the number of used bytes.
  • fs_inodes_percent_free, the percentage of free inodes on the file system.
  • fs_inodes_percent_reserved, the percentage of reserved inodes.
  • fs_inodes_percent_used, the percentage of used inodes.
  • fs_space_percent_free, the percentage of free bytes.
  • fs_space_percent_reserved, the percentage of reserved bytes.
  • fs_space_percent_used, the percentage of used bytes.

System load

  • load_longterm, the system load average over the last 15 minutes.
  • load_midterm, the system load average over the last 5 minutes.
  • load_shortterm, the system load averge over the last minute.

Memory

  • memory_buffered, the amount of memory (in bytes) which is buffered.
  • memory_cached, the amount of memory (in bytes) which is cached.
  • memory_free, the amount of memory (in bytes) which is free.
  • memory_used, the amount of memory (in bytes) which is used.

Network

Metrics have a interface field that contains the interface name to which the metric applies (eg ‘eth0’, ‘eth1’ and so on).

  • if_errors_rx, the number of errors per second detected when receiving from the interface.
  • if_errors_tx, the number of errors per second detected when transmitting from the interace.
  • if_octets_rx, the number of octets (bytes) received per second by the interace.
  • if_octets_tx, the number of octets (bytes) transmitted per second by the interface.
  • if_packets_rx, the number of packets received per second by the interace.
  • if_packets_tx, the number of packets transmitted per second by the interface.

Processes

  • processes_fork_rate, the number of processes forked per second.
  • processes_count, the number of processes in a given state. The metric has a state field (one of ‘blocked’, ‘paging’, ‘running’, ‘sleeping’, ‘stopped’ or ‘zombies’).

Swap

  • swap_cached, the amount of cached memory (in bytes) which is in the swap.
  • swap_free, the amount of free memory (in bytes) which is in the swap.
  • swap_used, the amount of used memory (in bytes) which is in the swap.
  • swap_io_in, the number of swap pages written per second.
  • swap_io_out, the number of swap pages read per second.

Users

  • logged_users, the number of users currently logged-in.

Apache

  • apache_bytes, the number of bytes per second transmitted by the server.
  • apache_requests, the number of requests processed per second.
  • apache_connections, the current number of active connections.
  • apache_idle_workers, the current number of idle workers.
  • apache_workers_closing, the number of workers in closing state.
  • apache_workers_dnslookup, the number of workers in DNS lookup state.
  • apache_workers_finishing, the number of workers in finishing state.
  • apache_workers_idle_cleanup, the number of workers in idle cleanup state.
  • apache_workers_keepalive, the number of workers in keepalive state.
  • apache_workers_logging, the number of workers in logging state.
  • apache_workers_open, the number of workers in open state.
  • apache_workers_reading, the number of workers in reading state.
  • apache_workers_sending, the number of workers in sending state.
  • apache_workers_starting, the number of workers in starting state.
  • apache_workers_waiting, the number of workers in waiting state.

MySQL

Commands

mysql_commands, the number of times per second a given statement has been executed. The metric has a command field that contains the statement to which it applies. The values can be:

  • change_db for the USE statement.
  • commit for the COMMIT statement.
  • flush for the FLUSH statement.
  • insert for the INSERT statement.
  • rollback for the ROLLBACK statement.
  • select for the SELECT statement.
  • set_option for the SET statement.
  • show_collations for the SHOW COLLATION statement.
  • show_databases for the SHOW DATABASES statement.
  • show_fields for the SHOW FIELDS statement.
  • show_master_status for the SHOW MASTER STATUS statement.
  • show_status for the SHOW STATUS statement.
  • show_tables for the SHOW TABLES statement.
  • show_variables for the SHOW VARIABLES statement.
  • show_warnings for the SHOW WARNINGS statement.
  • update for the UPDATE statement.

Handlers

mysql_handler, the number of times per second a given handler has been executed. The metric has a handler field that contains the handler to which it applies. The values can be:

  • commit for the internal COMMIT statements.
  • delete for the internal DELETE statements.
  • external_lock for the external locks.
  • read_first for the requests that read the first entry in an index.
  • read_key for the requests that read a row based on a key.
  • read_next for the requests that read the next row in key order.
  • read_prev for the requests that read the previous row in key order.
  • read_rnd for the requests that read a row based on a fixed position.
  • read_rnd_next for the requests that read the next row in the data file.
  • rollback the requests that perform rollback operation.
  • update the requests that update a row in a table.
  • write the requests that insert a row in a table.

Locks

  • mysql_locks_immediate, the number of times per second the requests for table locks could be granted immediately.
  • mysql_locks_waited, the number of times per second the requests for table locks had to wait.

Network

  • mysql_octets_rx, the number of bytes received per second by the server.
  • mysql_octets_tx, the number of bytes sent per second by the server.

Threads

  • mysql_threads_cached, the number of threads in the thread cache.
  • mysql_threads_connected, the number of currently open connections.
  • mysql_threads_running, the number of threads that are not sleeping.
  • mysql_threads_created, the number of threads created per second to handle connections.

Cluster

These metrics are collected with statement ‘SHOW STATUS’. see Percona documentation for further details.

  • mysql_cluster_size, current number of nodes in the cluster.
  • mysql_cluster_status, 1 when the node is ‘Primary’, 2 if ‘Non-Primary’ and 3 if ‘Disconnected’.
  • mysql_cluster_connected, 1 when the node is connected to the cluster, 0 otherwise.
  • mysql_cluster_ready, 1 when the node is ready to accept queries, 0 otherwise.
  • mysql_cluster_local_commits, number of writesets commited on the node.
  • mysql_cluster_received_bytes, total size in bytes of writesets received from other nodes.
  • mysql_cluster_received, total number of writesets received from other nodes.
  • mysql_cluster_replicated_bytes total size in bytes of writesets sent to other nodes.
  • mysql_cluster_replicated, total number of writesets sent to other nodes.
  • mysql_cluster_local_cert_failures, number of writesets that failed the certification test.
  • mysql_cluster_local_send_queue, the number of writesets waiting to be sent.
  • mysql_cluster_local_recv_queue, the number of writesets waiting to be applied.

Slow Queries

This metric is collected with statement ‘SHOW STATUS where Variable_name = ‘Slow_queries’.

  • mysql_slow_queries, number of queries that have taken more than X seconds, depending of the MySQL configuration parameter ‘long_query_time’ (10s per default)

RabbitMQ

Cluster

  • rabbitmq_connections, total number of connections.
  • rabbitmq_consumers, total number of consumers.
  • rabbitmq_exchanges, total number of exchanges.
  • rabbitmq_memory, bytes of memory consumed by the Erlang process associated with all queues, including stack, heap and internal structures.
  • rabbitmq_used_memory, bytes of memory used by the whole RabbitMQ process.
  • rabbitmq_remaining_memory, the difference between rabbitmq_vm_memory_limit and rabbitmq_used_memory.
  • rabbitmq_messages, total number of messages which are ready to be consumed or not yet acknowledged.
  • rabbitmq_total_nodes, total number of nodes in the cluster.
  • rabbitmq_running_nodes, total number of running nodes in the cluster.
  • rabbitmq_queues, total number of queues.
  • rabbitmq_unmirrored_queues, total number of queues that are not mirrored.
  • rabbitmq_vm_memory_limit, the maximum amount of memory allocated for RabbitMQ. When rabbitmq_used_memory uses more than this value, all producers are blocked.
  • rabbitmq_disk_free_limit, the minimum amount of free disk for RabbitMQ. When rabbitmq_disk_free drops below this value, all producers are blocked.
  • rabbitmq_disk_free, the disk free space.
  • rabbitmq_remaining_disk, the difference between rabbitmq_disk_free and rabbitmq_disk_free_limit.

Queues

All metrics have a queue field which contains the name of the RabbitMQ queue.

  • rabbitmq_queue_consumers, number of consumers for a given queue.
  • rabbitmq_queue_memory, bytes of memory consumed by the Erlang process associated with the queue, including stack, heap and internal structures.
  • rabbitmq_queue_messages, number of messages which are ready to be consumed or not yet acknowledged for the given queue.

HAProxy

frontend and backend field values can be:

  • cinder-api
  • glance-api
  • glance-registry-api
  • heat-api
  • heat-cfn-api
  • heat-cloudwatch-api
  • horizon-web (when Horizon is deployed without TLS)
  • horizon-https (when Horizon is deployed with TLS)
  • keystone-public-api
  • keystone-admin-api
  • mysqld-tcp
  • murano-api
  • neutron-api
  • nova-api
  • nova-ec2-api
  • nova-metadata-api
  • nova-novncproxy-websocket
  • sahara-api
  • swift-api

Server

  • haproxy_connections, the number of current connections.
  • haproxy_ssl_connections, the number of current SSL connections.
  • haproxy_pipes_free, the number of free pipes.
  • haproxy_pipes_used, the number of used pipes.
  • haproxy_run_queue, the number of connections waiting in the queue.
  • haproxy_tasks, the number of tasks.
  • haproxy_uptime, the HAProxy server uptime in seconds.

Frontends

  • haproxy_frontend_bytes_in, the total number of bytes received by all frontends.
  • haproxy_frontend_bytes_out, the total number of bytes transmitted by all frontends.
  • haproxy_frontend_session_current, the total number of current sessions for all frontends.

The following metrics have a frontend field that contains the name of the frontend server.

  • haproxy_frontend_bytes_in, the number of bytes received by the frontend.
  • haproxy_frontend_bytes_out, the number of bytes transmitted by the frontend.
  • haproxy_frontend_denied_requests, the number of denied requests.
  • haproxy_frontend_denied_responses, the number of denied responses.
  • haproxy_frontend_error_requests, the number of error requests.
  • haproxy_frontend_response_1xx, the number of HTTP responses with 1xx code.
  • haproxy_frontend_response_2xx, the number of HTTP responses with 2xx code.
  • haproxy_frontend_response_3xx, the number of HTTP responses with 3xx code.
  • haproxy_frontend_response_4xx, the number of HTTP responses with 4xx code.
  • haproxy_frontend_response_5xx, the number of HTTP responses with 5xx code.
  • haproxy_frontend_response_other, the number of HTTP responses with other code.
  • haproxy_frontend_session_current, the number of current sessions.
  • haproxy_frontend_session_total, the cumulative of total number of session.

Backends

  • haproxy_backend.bytes_in, the total number of bytes received by all backends.
  • haproxy_backend.bytes_out, the total number of bytes transmitted by all backends.
  • haproxy_backend.queue_current, the total number of requests in queue for all backends.
  • haproxy_backend.session_current, the total number of current sessions for all backends.
  • haproxy_backend.error_responses, the total number of error responses for all backends.

The following metrics have a backend field that contains the name of the backend server.

  • haproxy_backend_bytes_in, the number of bytes received by the backend.
  • haproxy_backend_bytes_out, the number of bytes transmitted by the backend.
  • haproxy_backend_denied_requests, the number of denied requests.
  • haproxy_backend_denied_responses, the number of denied responses.
  • haproxy_backend_downtime, the total downtime in second.
  • haproxy_backend_status, the global backend status where values 0 and 1 represent respectively DOWN (all backends are down) and UP (at least one backend is up).
  • haproxy_backend_error_connection, the number of error connections.
  • haproxy_backend_error_responses, the number of error responses.
  • haproxy_backend_queue_current, the number of requests in queue.
  • haproxy_backend_redistributed, the number of times a request was redispatched to another server.
  • haproxy_backend_response_1xx, the number of HTTP responses with 1xx code.
  • haproxy_backend_response_2xx, the number of HTTP responses with 2xx code.
  • haproxy_backend_response_3xx, the number of HTTP responses with 3xx code.
  • haproxy_backend_response_4xx, the number of HTTP responses with 4xx code.
  • haproxy_backend_response_5xx, the number of HTTP responses with 5xx code.
  • haproxy_backend_response_other, the number of HTTP responses with other code.
  • haproxy_backend_retries, the number of times a connection to a server was retried.
  • haproxy_backend_servers, the count of servers grouped by state. This metric has an additional state field that contains the state of the backends (either ‘down’ or ‘up’).
  • haproxy_backend_session_current, the number of current sessions.
  • haproxy_backend_session_total, the cumulative number of sessions.

Memcached

  • memcached_command_flush, cumulative number of flush reqs.
  • memcached_command_get, cumulative number of retrieval reqs.
  • memcached_command_set, cumulative number of storage reqs.
  • memcached_command_touch, cumulative number of touch reqs.
  • memcached_connections_current, number of open connections.
  • memcached_items_current, current number of items stored.
  • memcached_octets_rx, total number of bytes read by this server from network.
  • memcached_octets_tx, total number of bytes sent by this server to network.
  • memcached_ops_decr_hits, number of successful decr reqs.
  • memcached_ops_decr_misses, number of decr reqs against missing keys.
  • memcached_ops_evictions, number of valid items removed from cache to free memory for new items.
  • memcached_ops_hits, number of keys that have been requested.
  • memcached_ops_incr_hits, number of successful incr reqs.
  • memcached_ops_incr_misses, number of successful incr reqs.
  • memcached_ops_misses, number of items that have been requested and not found.
  • memcached_df_cache_used, current number of bytes used to store items.
  • memcached_df_cache_free, current number of free bytes to store items.
  • memcached_percent_hitratio, percentage of get command hits (in cache).

See memcached documentation for further details.

OpenStack

Service checks

  • openstack_check_api, the service’s API status, 1 if it is responsive, 0 otherwise.

    The metric contains a service field that identifies the OpenStack service being checked.

<service> is one of the following values with their respective resource checks:

  • ‘nova-api’: ‘/’
  • ‘cinder-api’: ‘/’
  • ‘cinder-v2-api’: ‘/’
  • ‘glance-api’: ‘/’
  • ‘heat-api’: ‘/’
  • ‘heat-cfn-api’: ‘/’
  • ‘keystone-public-api’: ‘/’
  • ‘neutron-api’: ‘/’
  • ‘ceilometer-api’: ‘/v2/capabilities’
  • ‘swift-api’: ‘/healthcheck’
  • ‘swift-s3-api’: ‘/healthcheck’

Note

All checks are performed without authentication except for Ceilometer.

Compute

These metrics are emitted per compute node.

  • openstack_nova_instance_creation_time, the time (in seconds) it took to launch a new instance.
  • openstack_nova_instance_state, the count of instances which entered a given state (the value is always 1). The metric contains a state field.

These metrics are retrieved from the Nova API and represent the aggregated values across all compute nodes.

  • openstack_nova_total_free_disk, the total amount of disk space (in GB) available for new instances.
  • openstack_nova_total_used_disk, the total amount of disk space (in GB) used by the instances.
  • openstack_nova_total_free_ram, the total amount of memory (in MB) available for new instances.
  • openstack_nova_total_used_ram, the total amount of memory (in MB) used by the instances.
  • openstack_nova_total_free_vcpus, the total number of virtual CPU available for new instances.
  • openstack_nova_total_used_vcpus, the total number of virtual CPU used by the instances.
  • openstack_nova_total_running_instances, the total number of running instances.
  • openstack_nova_total_running_tasks, the total number of tasks currently executed.

These metrics are retrieved from the Nova API.

  • openstack_nova_instances, the total count of instances in a given state. The metric contains a state field which is one of ‘active’, ‘deleted’, ‘error’, ‘paused’, ‘resumed’, ‘rescued’, ‘resized’, ‘shelved_offloaded’ or ‘suspended’.

These metrics are retrieved from the Nova database.

  • openstack_nova_services, the total count of Nova services by state. The metric contains a service field (one of ‘compute’, ‘conductor’, ‘scheduler’, ‘cert’ or ‘consoleauth’) and a state field (one of ‘up’, ‘down’ or ‘disabled’).

Identity

These metrics are retrieved from the Keystone API.

  • openstack_keystone_roles, the total number of roles.
  • openstack_keystone_tenants, the number of tenants by state. The metric contains a state field (either ‘enabled’ or ‘disabled’).
  • openstack_keystone_users, the number of users by state. The metric contains a state field (either ‘enabled’ or ‘disabled’).

Volume

These metrics are emitted per volume node.

  • openstack_cinder_volume_creation_time, the time (in seconds) it took to create a new volume.

Note

When using Ceph as the backend storage for volumes, the hostname value is always set to rbd.

These metrics are retrieved from the Cinder API.

  • openstack_cinder_volumes, the number of volumes by state. The metric contains a state field.
  • openstack_cinder_snapshots, the number of snapshots by state. The metric contains a state field.
  • openstack_cinder_volumes_size, the total size (in bytes) of volumes by state. The metric contains a state field.
  • openstack_cinder_snapshots_size, the total size (in bytes) of snapshots by state. The metric contains a state field.

state is one of ‘available’, ‘creating’, ‘attaching’, ‘in-use’, ‘deleting’, ‘backing-up’, ‘restoring-backup’, ‘error’, ‘error_deleting’, ‘error_restoring’, ‘error_extending’.

These metrics are retrieved from the Cinder database.

  • openstack_cinder_services, the total count of Cinder services by state. The metric contains a service field (one of ‘volume’, ‘backup’, ‘scheduler’) and a state field (one of ‘up’, ‘down’ or ‘disabled’).

Image

These metrics are retrieved from the Glance API.

  • openstack_glance_images, the number of images by state and visibility. The metric contains state and visibility field.
  • openstack_glance_snapshots, the number of snapshot images by state and visibility. The metric contains state and visibility field.
  • openstack_glance_images_size, the total size (in bytes) of images by state and visibility. The metric contains state and visibility field.
  • openstack_glance_snapshots_size, the total size (in bytes) of snapshots by state and visibility. The metric contains state and visibility field.

state is one of ‘queued’, ‘saving’, ‘active’, ‘killed’, ‘deleted’, ‘pending_delete’. visibility is either ‘public’ or ‘private’.

Network

These metrics are retrieved from the Neutron API.

  • openstack_neutron_networks, the number of virtual networks by state. The metric contains a state field.
  • openstack_neutron_subnets, the number of virtual subnets.
  • openstack_neutron_ports, the number of virtual ports by owner and state. The metric contains owner and state fields.
  • openstack_neutron_routers, the number of virtual routers by state. The metric contains a state field.
  • openstack_neutron_floatingips, the total number of floating IP addresses.

<state> is one of ‘active’, ‘build’, ‘down’ or ‘error’.

<owner> is one of ‘compute’, ‘dhcp’, ‘floatingip’, ‘floatingip_agent_gateway’, ‘router_interface’, ‘router_gateway’, ‘router_ha_interface’, ‘router_interface_distributed’ or ‘router_centralized_snat’.

These metrics are retrieved from the Neutron database.

  • openstack_neutron_agents, the total number of Neutron agents by service and state. The metric contains service (one of ‘dhcp’, ‘l3’, ‘metadata’ or ‘openvswitch’) and state (one of ‘up’, ‘down’ or ‘disabled’) fields.

API response times

  • openstack_<service>_http_responses, the time (in second) it took to serve the HTTP request. The metric contains http_method (eg ‘GET’, ‘POST’, and so on) and http_status (eg ‘200’, ‘404’, and so on) fields.

<service> is one of ‘cinder’, ‘glance’, ‘heat’ ‘keystone’, ‘neutron’ or ‘nova’.

Ceph

All Ceph metrics have a cluster field containing the name of the Ceph cluster (ceph by default).

See cluster monitoring and RADOS monitoring for further details.

Cluster

  • ceph_health, the health status of the entire cluster where values 1, 2 , 3 represent respectively OK, WARNING and ERROR.
  • ceph_monitor_count, number of ceph-mon processes.
  • ceph_quorum_count, number of ceph-mon processes participating in the quorum.

Pools

  • ceph_pool_total_bytes, total number of bytes for all pools.
  • ceph_pool_total_used_bytes, total used size in bytes by all pools.
  • ceph_pool_total_avail_bytes, total available size in bytes for all pools.
  • ceph_pool_total_number, total number of pools.

The folllowing metrics have a pool field that contains the name of the Ceph pool.

  • ceph_pool_bytes_used, amount of data in bytes used by the pool.
  • ceph_pool_max_avail, available size in bytes for the pool.
  • ceph_pool_objects, number of objects in the pool.
  • ceph_pool_read_bytes_sec, number of bytes read by second for the pool.
  • ceph_pool_write_bytes_sec, number of bytes written by second for the pool.
  • ceph_pool_op_per_sec, number of operations per second for the pool.
  • ceph_pool_size, number of data replications for the pool.
  • ceph_pool_pg_num, number of placement groups for the pool.

Placement Groups

  • ceph_pg_total, total number of placement groups.
  • ceph_pg_bytes_avail, available size in bytes.
  • ceph_pg_bytes_total, cluster total size in bytes.
  • ceph_pg_bytes_used, data stored size in bytes.
  • ceph_pg_data_bytes, stored data size in bytes before it is replicated, cloned or snapshotted.
  • ceph_pg_state, number of placement groups in a given state. The metric contains a state field whose value is <state> is a combination separated by + of 2 or more states of this list: creating, active, clean, down, replay, splitting, scrubbing, degraded, inconsistent, peering, repair, recovering, recovery_wait, backfill, backfill-wait, backfill_toofull, incomplete, stale, remapped.

OSD Daemons

  • ceph_osd_up, number of OSD daemons UP.
  • ceph_osd_down, number of OSD daemons DOWN.
  • ceph_osd_in, number of OSD daemons IN.
  • ceph_osd_out, number of OSD daemons OUT.

The following metrics have an osd field that contains the OSD identifier.

  • ceph_osd_used, data stored size in bytes for the given OSD.
  • ceph_osd_total, total size in bytes for the given OSD.
  • ceph_osd_apply_latency, apply latency in ms for the given OSD.
  • ceph_osd_commit_latency, commit latency in ms for the given OSD.

OSD Performance

All the following metrics are retrieved per OSD daemon from the corresponding socket /var/run/ceph/ceph-osd.<ID>.asok by issuing the command perf dump.

All metrics have an osd field that contains the OSD identifier.

Note

These metrics are not collected when a node has both the ceph-osd and controller roles.

See OSD performance counters for further details.

  • ceph_perf_osd_recovery_ops, number of recovery operations in progress.
  • ceph_perf_osd_op_wip, number of replication operations currently being processed (primary).
  • ceph_perf_osd_op, number of client operations.
  • ceph_perf_osd_op_in_bytes, number of bytes received from clients for write operations.
  • ceph_perf_osd_op_out_bytes, number of bytes sent to clients for read operations.
  • ceph_perf_osd_op_latency, average latency in ms for client operations (including queue time).
  • ceph_perf_osd_op_process_latency, average latency in ms for client operations (excluding queue time).
  • ceph_perf_osd_op_r, number of client read operations.
  • ceph_perf_osd_op_r_out_bytes, number of bytes sent to clients for read operations.
  • ceph_perf_osd_op_r_latency, average latency in ms for read operation (including queue time).
  • ceph_perf_osd_op_r_process_latency, average latency in ms for read operation (excluding queue time).
  • ceph_perf_osd_op_w, number of client write operations.
  • ceph_perf_osd_op_w_in_bytes, number of bytes received from clients for write operations.
  • ceph_perf_osd_op_w_rlat, average latency in ms for write operations with readable/applied.
  • ceph_perf_osd_op_w_latency, average latency in ms for write operations (including queue time).
  • ceph_perf_osd_op_w_process_latency, average latency in ms for write operation (excluding queue time).
  • ceph_perf_osd_op_rw, number of client read-modify-write operations.
  • ceph_perf_osd_op_rw_in_bytes, number of bytes per second received from clients for read-modify-write operations.
  • ceph_perf_osd_op_rw_out_bytes, number of bytes per second sent to clients for read-modify-write operations.
  • ceph_perf_osd_op_rw_rlat, average latency in ms for read-modify-write operations with readable/applied.
  • ceph_perf_osd_op_rw_latency, average latency in ms for read-modify-write operations (including queue time).
  • ceph_perf_osd_op_rw_process_latency, average latency in ms for read-modify-write operations (excluding queue time).

Pacemaker

Resource location

  • pacemaker_resource_local_active, 1 when the resource is located on the host reporting the metric, 0 otherwise. The metric contains a resource field which is one of ‘vip__public’, ‘vip__management’, ‘vip__vrouter_pub’ or ‘vip__vrouter’.

Clusters

The cluster metrics are emitted by the GSE plugins (See the Alarms Configuration Guide for details).

  • cluster_service_status, the status of the service cluster. The metric contains a cluster_name field that identifies the service cluster.
  • cluster_node_status, the status of the node cluster. The metric contains a cluster_name field that identifies the node cluster.
  • cluster_status, the status of the global cluster. The metric contains a cluster_name field that identifies the global cluster.

The supported values for these metrics are:

  • 0 for the Okay status.
  • 1 for the Warning status.
  • 2 for the Unknown status.
  • 3 for the Critical status.
  • 4 for the Down status.

LMA self-monitoring

System

Metrics have a service field with the name of the service it applies to. Values can be: hekad, collectd, influxd, grafana-server or elasticsearch.

  • lma_components_count_processes, number of processes currently running.
  • lma_components_count_threads, number of threads currently running.
  • lma_components_cputime_user, percentage of CPU time spent in user mode by the service. It can be greater than 100% when the node has more than one CPU.
  • lma_components_cputime_syst, percentage of CPU time spent in system mode by the service. It can be greater than 100% when the node has more than one CPU.
  • lma_components_disk_bytes_read, number of bytes read from disk(s) per second.
  • lma_components_disk_bytes_write, number of bytes written to disk(s) per second.
  • lma_components_disk_ops_read, number of read operations from disk(s) per second.
  • lma_components_disk_ops_write, number of write operations to disk(s) per second.
  • lma_components_memory_code, physical memory devoted to executable code (bytes).
  • lma_components_memory_data, physical memory devoted to other than executable code (bytes).
  • lma_components_memory_rss, non-swapped physical memory used (bytes).
  • lma_components_memory_vm, virtual memory size (bytes).
  • lma_components_pagefaults_minflt, minor page faults per second.
  • lma_components_pagefaults_majflt, major page faults per second.
  • lma_components_stacksize, absolute value of the address of the start (i.e., bottom) of the stack minus the current value of the stack pointer.

Heka pipeline

Metrics have two fields: name that contains the name of the decoder or filter as defined by Heka and type that is either decoder or filter.

Metrics for both types:

  • hekad_msg_avg_duration, the average time for processing the message (in nanoseconds).
  • hekad_msg_count, the total number of messages processed by the decoder. This will reset to 0 when the process is restarted.
  • hekad_memory, the total memory used by the Sandbox (in bytes).

Additional metrics for filter type:

  • heakd_timer_event_avg_duration, the average time for executing the timer_event function (in nanoseconds).
  • hekad_timer_event_count, the total number of executions of the timer_event function. This will reset to 0 when the process is restarted.