List of metrics

The following is a list of metrics that are emitted by the StackLight Collector. The metrics are listed by category, then by metric name.

System

CPU

Metrics have a cpu_number field that contains the CPU number to which the metric applies.

  • cpu_idle, the percentage of CPU time spent in the idle task.
  • cpu_interrupt, the percentage of CPU time spent servicing interrupts.
  • cpu_nice, the percentage of CPU time spent in user mode with low priority (nice).
  • cpu_softirq, the percentage of CPU time spent servicing soft interrupts.
  • cpu_steal, the percentage of CPU time spent in other operating systems.
  • cpu_system, the percentage of CPU time spent in system mode.
  • cpu_user, the percentage of CPU time spent in user mode.
  • cpu_wait, the percentage of CPU time spent waiting for I/O operations to complete.

Disk

Metrics have a device field that contains the disk device number the metric applies to. For example, ‘sda’, ‘sdb’, and others.

  • disk_merged_read, the number of read operations per second that could be merged with already queued operations.
  • disk_merged_write, the number of write operations per second that could be merged with already queued operations.
  • disk_octets_read, the number of octets (bytes) read per second.
  • disk_octets_write, the number of octets (bytes) written per second.
  • disk_ops_read, the number of read operations per second.
  • disk_ops_write, the number of write operations per second.
  • disk_time_read, the average time for a read operation to complete in the last interval.
  • disk_time_write, the average time for a write operation to complete in the last interval.

File system

Metrics have a fs field that contains the partition’s mount point to which the metric applies. For example, ‘/’, ‘/var/lib’, and others.

  • fs_inodes_free, the number of free inodes on the file system.
  • fs_inodes_percent_free, the percentage of free inodes on the file system.
  • fs_inodes_percent_reserved, the percentage of reserved inodes.
  • fs_inodes_percent_used, the percentage of used inodes.
  • fs_inodes_reserved, the number of reserved inodes.
  • fs_inodes_used, the number of used inodes.
  • fs_space_free, the number of free bytes.
  • fs_space_percent_free, the percentage of free bytes.
  • fs_space_percent_reserved, the percentage of reserved bytes.
  • fs_space_percent_used, the percentage of used bytes.
  • fs_space_reserved, the number of reserved bytes.
  • fs_space_used, the number of used bytes.

System load

  • load_longterm, the system load average over the last 15 minutes.
  • load_midterm, the system load average over the last 5 minutes.
  • load_shortterm, the system load average over the last minute.

Memory

  • memory_buffered, the amount of buffered memory in bytes.
  • memory_cached, the amount of cached memory in bytes.
  • memory_free, the amount of free memory in bytes.
  • memory_used, the amount of used memory in bytes.

Network

Metrics have an interface field that contains the interface name the metric applies to. For example, ‘eth0’, ‘eth1’, and others.

  • if_errors_rx, the number of errors per second detected when receiving from the interface.
  • if_errors_tx, the number of errors per second detected when transmitting from the interface.
  • if_octets_rx, the number of octets (bytes) received per second by the interface.
  • if_octets_tx, the number of octets (bytes) transmitted per second by the interface.
  • if_packets_rx, the number of packets received per second by the interface.
  • if_packets_tx, the number of packets transmitted per second by the interface.

Processes

  • processes_count, the number of processes in a given state. The metric has a state field (one of ‘blocked’, ‘paging’, ‘running’, ‘sleeping’, ‘stopped’ or ‘zombies’).
  • processes_fork_rate, the number of processes forked per second.

Swap

  • swap_cached, the amount of cached memory (in bytes) that is in the swap.
  • swap_free, the amount of free memory (in bytes) that is in the swap.
  • swap_io_in, the number of swap pages written per second.
  • swap_io_out, the number of swap pages read per second.
  • swap_used, the amount of used memory (in bytes) that is in the swap.

Users

  • logged_users, the number of users currently logged in.

Apache

  • apache_bytes, the number of bytes per second transmitted by the server.
  • apache_connections, the current number of active connections.
  • apache_idle_workers, the current number of idle workers.
  • apache_requests, the number of requests processed per second.
  • apache_workers_closing, the number of workers in closing state.
  • apache_workers_dnslookup, the number of workers in DNS lookup state.
  • apache_workers_finishing, the number of workers in finishing state.
  • apache_workers_idle_cleanup, the number of workers in idle cleanup state.
  • apache_workers_keepalive, the number of workers in keepalive state.
  • apache_workers_logging, the number of workers in logging state.
  • apache_workers_open, the number of workers in open state.
  • apache_workers_reading, the number of workers in reading state.
  • apache_workers_sending, the number of workers in sending state.
  • apache_workers_starting, the number of workers in starting state.
  • apache_workers_waiting, the number of workers in waiting state.

MySQL

Commands

mysql_commands, the number of times per second a given statement has been executed. The metric has a statement field that contains the statement to which it applies. The values can be as follows:

  • change_db for the USE statement.
  • commit for the COMMIT statement.
  • flush for the FLUSH statement.
  • insert for the INSERT statement.
  • rollback for the ROLLBACK statement.
  • select for the SELECT statement.
  • set_option for the SET statement.
  • show_collations for the SHOW COLLATION statement.
  • show_databases for the SHOW DATABASES statement.
  • show_fields for the SHOW FIELDS statement.
  • show_master_status for the SHOW MASTER STATUS statement.
  • show_status for the SHOW STATUS statement.
  • show_tables for the SHOW TABLES statement.
  • show_variables for the SHOW VARIABLES statement.
  • show_warnings for the SHOW WARNINGS statement.
  • update for the UPDATE statement.

Handlers

mysql_handler, the number of times per second a given handler has been executed. The metric has a handler field that contains the handler it applies to. The values can be as follows:

  • commit for the internal COMMIT statements.
  • delete for the internal DELETE statements.
  • external_lock for the external locks.
  • read_first for the requests that read the first entry in an index.
  • read_key for the requests that read a row based on a key.
  • read_next for the requests that read the next row in key order.
  • read_prev for the requests that read the previous row in key order.
  • read_rnd for the requests that read a row based on a fixed position.
  • read_rnd_next for the requests that read the next row in the data file.
  • rollback the requests that perform the rollback operation.
  • update the requests that update a row in a table.
  • write the requests that insert a row in a table.

Locks

  • mysql_locks_immediate, the number of times per second the requests for table locks could be granted immediately.
  • mysql_locks_waited, the number of times per second the requests for table locks had to wait.

Network

  • mysql_octets_rx, the number of bytes per second received by the server.
  • mysql_octets_tx, the number of bytes per second sent by the server.

Threads

  • mysql_threads_cached, the number of threads in the thread cache.
  • mysql_threads_connected, the number of currently open connections.
  • mysql_threads_created, the number of threads created per second to handle connections.
  • mysql_threads_running, the number of threads that are not sleeping.

Cluster

The following metrics are collected with statement ‘SHOW STATUS’. For details, see Percona documentation.

  • mysql_cluster_connected, 1 when the node is connected to the cluster, if not, then 0.
  • mysql_cluster_local_cert_failures, the number of write sets that failed the certification test.
  • mysql_cluster_local_commits, the number of write sets committed on the node.
  • mysql_cluster_local_recv_queue, the number of write sets waiting to be applied.
  • mysql_cluster_local_send_queue, the number of write sets waiting to be sent.
  • mysql_cluster_ready, 1 when the node is ready to accept queries, if not, then 0.
  • mysql_cluster_received, the total number of write sets received from other nodes.
  • mysql_cluster_received_bytes, the total size in bytes of write sets received from other nodes.
  • mysql_cluster_replicated, the total number of write sets sent to other nodes.
  • mysql_cluster_replicated_bytes the total size in bytes of write sets sent to other nodes.
  • mysql_cluster_size, the current number of nodes in the cluster.
  • mysql_cluster_status, 1 when the node is ‘Primary’, 2 if ‘Non-Primary’, and 3 if ‘Disconnected’.

Slow queries

The following metric is collected with statement ‘SHOW STATUS where Variable_name = ‘Slow_queries’.

  • mysql_slow_queries, the number of queries that have taken more than X seconds, depending on the MySQL configuration parameter ‘long_query_time’ (10s per default).

RabbitMQ

Cluster

  • rabbitmq_connections, the total number of connections.
  • rabbitmq_consumers, the total number of consumers.
  • rabbitmq_channels, the total number of channels.
  • rabbitmq_exchanges, the total number of exchanges.
  • rabbitmq_messages, the total number of messages which are ready to be consumed or not yet acknowledged.
  • rabbitmq_queues, the total number of queues.
  • rabbitmq_running_nodes, the total number of running nodes in the cluster.
  • rabbitmq_disk_free, the free disk space.
  • rabbitmq_disk_free_limit, the minimum amount of free disk space for RabbitMQ. When rabbitmq_disk_free drops below this value, all producers are blocked.
  • rabbitmq_remaining_disk, the difference between rabbitmq_disk_free and rabbitmq_disk_free_limit.
  • rabbitmq_used_memory, bytes of memory used by the whole RabbitMQ process.
  • rabbitmq_vm_memory_limit, the maximum amount of memory allocated for RabbitMQ. When rabbitmq_used_memory uses more than this value, all producers are blocked.
  • rabbitmq_remaining_memory, the difference between rabbitmq_vm_memory_limit and rabbitmq_used_memory.

HAProxy

The frontend and backend field values can be as follows:

  • cinder-api
  • glance-api
  • glance-registry-api
  • heat-api
  • heat-cfn-api
  • heat-cloudwatch-api
  • horizon-web (when Horizon is deployed without TLS)
  • horizon-https (when Horizon is deployed with TLS)
  • keystone-public-api
  • keystone-admin-api
  • mysqld-tcp
  • murano-api
  • neutron-api
  • nova-api
  • nova-metadata-api
  • nova-novncproxy-websocket
  • sahara-api
  • swift-api

Server

  • haproxy_connections, the number of current connections.
  • haproxy_pipes_free, the number of free pipes.
  • haproxy_pipes_used, the number of used pipes.
  • haproxy_run_queue, the number of connections waiting in the queue.
  • haproxy_ssl_connections, the number of current SSL connections.
  • haproxy_tasks, the number of tasks.
  • haproxy_uptime, the HAProxy server uptime in seconds.

Frontends

The following metrics have a frontend field that contains the name of the front-end server:

  • haproxy_frontend_bytes_in, the number of bytes received by the frontend.
  • haproxy_frontend_bytes_out, the number of bytes transmitted by the frontend.
  • haproxy_frontend_denied_requests, the number of denied requests.
  • haproxy_frontend_denied_responses, the number of denied responses.
  • haproxy_frontend_error_requests, the number of error requests.
  • haproxy_frontend_response_1xx, the number of HTTP responses with 1xx code.
  • haproxy_frontend_response_2xx, the number of HTTP responses with 2xx code.
  • haproxy_frontend_response_3xx, the number of HTTP responses with 3xx code.
  • haproxy_frontend_response_4xx, the number of HTTP responses with 4xx code.
  • haproxy_frontend_response_5xx, the number of HTTP responses with 5xx code.
  • haproxy_frontend_response_other, the number of HTTP responses with other code.
  • haproxy_frontend_session_current, the number of current sessions.
  • haproxy_frontend_session_total, the cumulative number of sessions.

Backends

The following metrics have a backend field that contains the name of the back-end server:

  • haproxy_backend_bytes_in, the number of bytes received by the back end.
  • haproxy_backend_bytes_out, the number of bytes transmitted by the back end.
  • haproxy_backend_denied_requests, the number of denied requests.
  • haproxy_backend_denied_responses, the number of denied responses.
  • haproxy_backend_downtime, the total downtime in seconds.
  • haproxy_backend_error_connection, the number of error connections.
  • haproxy_backend_error_responses, the number of error responses.
  • haproxy_backend_queue_current, the number of requests in queue.
  • haproxy_backend_redistributed, the number of times a request was redispatched to another server.
  • haproxy_backend_response_1xx, the number of HTTP responses with 1xx code.
  • haproxy_backend_response_2xx, the number of HTTP responses with 2xx code.
  • haproxy_backend_response_3xx, the number of HTTP responses with 3xx code.
  • haproxy_backend_response_4xx, the number of HTTP responses with 4xx code.
  • haproxy_backend_response_5xx, the number of HTTP responses with 5xx code.
  • haproxy_backend_response_other, the number of HTTP responses with other code.
  • haproxy_backend_retries, the number of times a connection to a server was retried.
  • haproxy_backend_servers, the count of servers grouped by state. This metric has an additional state field that contains the state of the back ends (either ‘down’ or ‘up’).
  • haproxy_backend_session_current, the number of current sessions.
  • haproxy_backend_session_total, the cumulative number of sessions.
  • haproxy_backend_status, the global back-end status where values 0 and 1 represent, respectively, DOWN (all back ends are down) and UP (at least one back end is up).

Memcached

  • memcached_command_flush, the cumulative number of flush reqs.
  • memcached_command_get, the cumulative number of retrieval reqs.
  • memcached_command_set, the cumulative number of storage reqs.
  • memcached_command_touch, the cumulative number of touch reqs.
  • memcached_connections_current, the number of open connections.
  • memcached_df_cache_free, the current number of free bytes to store items.
  • memcached_df_cache_used, the current number of bytes used to store items.
  • memcached_items_current, the current number of items stored.
  • memcached_octets_rx, the total number of bytes read by this server from the network.
  • memcached_octets_tx, the total number of bytes sent by this server to the network.
  • memcached_ops_decr_hits, the number of successful decr reqs.
  • memcached_ops_decr_misses, the number of decr reqs against missing keys.
  • memcached_ops_evictions, the number of valid items removed from cache to free memory for new items.
  • memcached_ops_hits, the number of keys that have been requested.
  • memcached_ops_incr_hits, the number of successful incr reqs.
  • memcached_ops_incr_misses, the number of successful incr reqs.
  • memcached_ops_misses, the number of items that have been requested and not found.
  • memcached_percent_hitratio, the percentage of get command hits (in cache).
  • memcached_ps_cputime_syst, the percentage of CPU time spent in system mode by memcached. It can be greater than 100% when the node has more than one CPU.
  • memcached_ps_cputime_user, the percentage of CPU time spent in user mode by memcached. It can be greater than 100% when the node has more than one CPU.

For details, see the Memcached documentation.

Libvirt

Every metric contains an instance_id field, which is the UUID of the instance for the Nova service.

CPU

  • virt_cpu_time, the average amount of CPU time (in nanoseconds) allocated to the virtual instance in a second.
  • virt_vcpu_time, the average amount of CPU time (in nanoseconds) allocated to the virtual CPU in a second. The metric contains a vcpu_number field which is the virtual CPU number.

Disk

Metrics have a device field that contains the virtual disk device to which the metric applies. For example, ‘vda’, ‘vdb’, and others.

  • virt_disk_octets_read, the number of octets (bytes) read per second.
  • virt_disk_octets_write, the number of octets (bytes) written per second.
  • virt_disk_ops_read, the number of read operations per second.
  • virt_disk_ops_write, the number of write operations per second.

Memory

  • virt_memory_total, the total amount of memory (in bytes) allocated to the virtual instance.

Network

Metrics have an interface field that contains the interface name to which the metric applies. For example, ‘tap0dc043a6-dd’, ‘tap769b123a-2e’, and others.

  • virt_if_dropped_rx, the number of dropped packets per second when receiving from the interface.
  • virt_if_dropped_tx, the number of dropped packets per second when transmitting from the interface.
  • virt_if_errors_rx, the number of errors per second detected when receiving from the interface.
  • virt_if_errors_tx, the number of errors per second detected when transmitting from the interface.
  • virt_if_octets_rx, the number of octets (bytes) received per second by the interface.
  • virt_if_octets_tx, the number of octets (bytes) transmitted per second by the interface.
  • virt_if_packets_rx, the number of packets received per second by the interface.
  • virt_if_packets_tx, the number of packets transmitted per second by the interface.

OpenStack

Service checks

  • openstack_check_api, the service’s API status, 1 if it is responsive,

    if not, then 0. The metric contains a service field that identifies the OpenStack service being checked.

<service> is one of the following values with their respective resource checks:

  • ‘ceilometer-api’: ‘/v2/capabilities’
  • ‘cinder-api’: ‘/’
  • ‘cinder-v2-api’: ‘/’
  • ‘glance-api’: ‘/’
  • ‘heat-api’: ‘/’
  • ‘heat-cfn-api’: ‘/’
  • ‘keystone-public-api’: ‘/’
  • ‘neutron-api’: ‘/’
  • ‘nova-api’: ‘/’
  • ‘swift-api’: ‘/healthcheck’
  • ‘swift-s3-api’: ‘/healthcheck’

Note

All checks except for Ceilometer are performed without authentication.

Compute

The following metrics are emitted per compute node:

  • openstack_nova_free_disk, the disk space in GB available for new instances.
  • openstack_nova_free_ram, the memory in MB available for new instances.
  • openstack_nova_free_vcpus, the number of virtual CPU available for new instances.
  • openstack_nova_instance_creation_time, the time in seconds it took to launch a new instance.
  • openstack_nova_instance_state, the number of instances which entered a given state (the value is always 1). The metric contains a state field.
  • openstack_nova_running_instances, the number of running instances.
  • openstack_nova_running_tasks, the number of tasks currently executed.
  • openstack_nova_used_disk, the disk space in GB used by the instances.
  • openstack_nova_used_ram, the memory in MB used by the instances.
  • openstack_nova_used_vcpus, the number of virtual CPU used by the instances.

The following metrics are retrieved from the Nova API and represent the aggregated values across all compute nodes.

  • openstack_nova_total_free_disk, the total amount of disk space in GB available for new instances.
  • openstack_nova_total_free_ram, the total amount of memory in MB available for new instances.
  • openstack_nova_total_free_vcpus, the total number of virtual CPU available for new instances.
  • openstack_nova_total_running_instances, the total number of running instances.
  • openstack_nova_total_running_tasks, the total number of tasks currently executed.
  • openstack_nova_total_used_disk, the total amount of disk space in GB used by the instances.
  • openstack_nova_total_used_ram, the total amount of memory in MB used by the instances.
  • openstack_nova_total_used_vcpus, the total number of virtual CPU used by the instances.

The following metrics are retrieved from the Nova API:

  • openstack_nova_instances, the total count of instances in a given state. The metric contains a state field which is one of ‘active’, ‘deleted’, ‘error’, ‘paused’, ‘resumed’, ‘rescued’, ‘resized’, ‘shelved_offloaded’ or ‘suspended’.

The following metrics are retrieved from the Nova database:

  • openstack_nova_service, the Nova service state (either 0 for ‘up’, 1 for ‘down’ or 2 for ‘disabled’). The metric contains a service field (one of ‘compute’, ‘conductor’, ‘scheduler’, ‘cert’ or ‘consoleauth’) and a state field (one of ‘up’, ‘down’ or ‘disabled’).
  • openstack_nova_services, the total count of Nova services by state. The metric contains a service field (one of ‘compute’, ‘conductor’, ‘scheduler’, ‘cert’ or ‘consoleauth’) and a state field (one of ‘up’, ‘down’, or ‘disabled’).

Identity

The following metrics are retrieved from the Keystone API:

  • openstack_keystone_roles, the total number of roles.
  • openstack_keystone_tenants, the number of tenants by state. The metric contains a state field (either ‘enabled’ or ‘disabled’).
  • openstack_keystone_users, the number of users by state. The metric contains a state field (either ‘enabled’ or ‘disabled’).

Volume

The following metrics are emitted per volume node:

  • openstack_cinder_volume_creation_time, the time in seconds it took to create a new volume.

Note

When using Ceph as the back end storage for volumes, the hostname value is always set to rbd.

The following metrics are retrieved from the Cinder API:

  • openstack_cinder_snapshots, the number of snapshots by state. The metric contains a state field.
  • openstack_cinder_snapshots_size, the total size (in bytes) of snapshots by state. The metric contains a state field.
  • openstack_cinder_volumes, the number of volumes by state. The metric contains a state field.
  • openstack_cinder_volumes_size, the total size (in bytes) of volumes by state. The metric contains a state field.

state is one of ‘available’, ‘creating’, ‘attaching’, ‘in-use’, ‘deleting’, ‘backing-up’, ‘restoring-backup’, ‘error’, ‘error_deleting’, ‘error_restoring’, ‘error_extending’.

The following metrics are retrieved from the Cinder database:

  • openstack_cinder_service, the Cinder service state (either 0 for ‘up’, 1 for ‘down’, or 2 for ‘disabled’). The metric contains a service field (one of ‘volume’, ‘backup’, ‘scheduler’) and a state field (one of ‘up’, ‘down’ or ‘disabled’).
  • openstack_cinder_services, the total count of Cinder services by state. The metric contains a service field (one of ‘volume’, ‘backup’, ‘scheduler’) and a state field (one of ‘up’, ‘down’ or ‘disabled’).

Image

The following metrics are retrieved from the Glance API:

  • openstack_glance_images, the number of images by state and visibility. The metric contains state and visibility fields.
  • openstack_glance_images_size, the total size (in bytes) of images by state and visibility. The metric contains state and visibility fields.
  • openstack_glance_snapshots, the number of snapshot images by state and visibility. The metric contains state and visibility fields.
  • openstack_glance_snapshots_size, the total size (in bytes) of snapshots by state and visibility. The metric contains state and visibility fields.

state is one of ‘queued’, ‘saving’, ‘active’, ‘killed’, ‘deleted’, ‘pending_delete’. visibility is either ‘public’ or ‘private’.

Network

The following metrics are retrieved from the Neutron API:

  • openstack_neutron_floatingips, the total number of floating IP addresses.
  • openstack_neutron_networks, the number of virtual networks by state. The metric contains a state field.
  • openstack_neutron_ports, the number of virtual ports by owner and state. The metric contains owner and state fields.
  • openstack_neutron_routers, the number of virtual routers by state. The metric contains a state field.
  • openstack_neutron_subnets, the number of virtual subnets.

<state> is one of ‘active’, ‘build’, ‘down’ or ‘error’.

<owner> is one of ‘compute’, ‘dhcp’, ‘floatingip’, ‘floatingip_agent_gateway’, ‘router_interface’, ‘router_gateway’, ‘router_ha_interface’, ‘router_interface_distributed’, or ‘router_centralized_snat’.

The following metrics are retrieved from the Neutron database:

Note

These metrics are not collected when the Contrail plugin is deployed.

  • openstack_neutron_agent, the Neutron agent state (either 0 for ‘up’, 1 for ‘down’, or 2 for ‘disabled’). The metric contains a service field (one of ‘dhcp’, ‘l3’, ‘metadata’, or ‘openvswitch’), and a state field (one of ‘up’, ‘down’ or ‘disabled’).
  • openstack_neutron_agents, the total number of Neutron agents by service and state. The metric contains service (one of ‘dhcp’, ‘l3’, ‘metadata’ or ‘openvswitch’) and state (one of ‘up’, ‘down’ or ‘disabled’) fields.

API response times

  • openstack_<service>_http_response_times, HTTP response time statistics. The statistics are min, max, sum, count, upper_90 (90 percentile) over 10 seconds. The metric contains an http_method field, for example, ‘GET’, ‘POST’, and others, and an http_status field, for example, ‘2xx’, ‘4xx’, and others.

<service> is one of ‘cinder’, ‘glance’, ‘heat’ ‘keystone’, ‘neutron’ or ‘nova’.

Logs

  • log_messages, the number of log messages per second for the given service and severity level. The metric contains service and level (one of ‘debug’, ‘info’, and others) fields.

Ceph

All Ceph metrics have a cluster field containing the name of the Ceph cluster (ceph by default).

For details, see Cluster monitoring and RADOS monitoring.

Cluster

  • ceph_health, the health status of the entire cluster where values 1, 2, 3 represent OK, WARNING and ERROR, respectively.
  • ceph_monitor_count, the number of ceph-mon processes.
  • ceph_quorum_count, the number of ceph-mon processes participating in the quorum.

Pools

  • ceph_pool_total_avail_bytes, the total available size in bytes for all pools.
  • ceph_pool_total_bytes, the total number of bytes for all pools.
  • ceph_pool_total_number, the total number of pools.
  • ceph_pool_total_used_bytes, the total used size in bytes by all pools.

The following metrics have a pool field that contains the name of the Ceph pool.

  • ceph_pool_bytes_used, the amount of data in bytes used by the pool.
  • ceph_pool_max_avail, the available size in bytes for the pool.
  • ceph_pool_objects, the number of objects in the pool.
  • ceph_pool_op_per_sec, the number of operations per second for the pool.
  • ceph_pool_pg_num, the number of placement groups for the pool.
  • ceph_pool_read_bytes_sec, the number of bytes read by second for the pool.
  • ceph_pool_size, the number of data replications for the pool.
  • ceph_pool_write_bytes_sec, the number of bytes written by second for the pool.

Placement Groups

  • ceph_pg_bytes_avail, the available size in bytes.
  • ceph_pg_bytes_total, the cluster total size in bytes.
  • ceph_pg_bytes_used, the data stored size in bytes.
  • ceph_pg_data_bytes, the stored data size in bytes before it is replicated, cloned or snapshotted.
  • ceph_pg_state, the number of placement groups in a given state. The metric contains a state field whose <state> value is a combination separated by + of 2 or more states of this list: creating, active, clean, down, replay, splitting, scrubbing, degraded, inconsistent, peering, repair, recovering, recovery_wait, backfill, backfill-wait, backfill_toofull, incomplete, stale, remapped.
  • ceph_pg_total, the total number of placement groups.

OSD Daemons

  • ceph_osd_down, the number of OSD daemons DOWN.
  • ceph_osd_in, the number of OSD daemons IN.
  • ceph_osd_out, the number of OSD daemons OUT.
  • ceph_osd_up, the number of OSD daemons UP.

The following metrics have an osd field that contains the OSD identifier:

  • ceph_osd_apply_latency, apply latency in ms for the given OSD.
  • ceph_osd_commit_latency, commit latency in ms for the given OSD.
  • ceph_osd_total, the total size in bytes for the given OSD.
  • ceph_osd_used, the data stored size in bytes for the given OSD.

OSD Performance

All the following metrics are retrieved per OSD daemon from the corresponding /var/run/ceph/ceph-osd.<ID>.asok socket by issuing the perf dump command.

All metrics have an osd field that contains the OSD identifier.

Note

These metrics are not collected when a node has both the ceph-osd and controller roles.

For details, see OSD performance counters.

  • ceph_perf_osd_op, the number of client operations.
  • ceph_perf_osd_op_in_bytes, the number of bytes received from clients for write operations.
  • ceph_perf_osd_op_latency, the average latency in ms for client operations (including queue time).
  • ceph_perf_osd_op_out_bytes, the number of bytes sent to clients for read operations.
  • ceph_perf_osd_op_process_latency, the average latency in ms for client operations (excluding queue time).
  • ceph_perf_osd_op_r, the number of client read operations.
  • ceph_perf_osd_op_r_latency, the average latency in ms for read operation (including queue time).
  • ceph_perf_osd_op_r_out_bytes, the number of bytes sent to clients for read operations.
  • ceph_perf_osd_op_r_process_latency, the average latency in ms for read operation (excluding queue time).
  • ceph_perf_osd_op_rw, the number of client read-modify-write operations.
  • ceph_perf_osd_op_rw_in_bytes, the number of bytes per second received from clients for read-modify-write operations.
  • ceph_perf_osd_op_rw_latency, the average latency in ms for read-modify-write operations (including queue time).
  • ceph_perf_osd_op_rw_out_bytes, the number of bytes per second sent to clients for read-modify-write operations.
  • ceph_perf_osd_op_rw_process_latency, the average latency in ms for read-modify-write operations (excluding queue time).
  • ceph_perf_osd_op_rw_rlat, the average latency in ms for read-modify-write operations with readable/applied.
  • ceph_perf_osd_op_w, the number of client write operations.
  • ceph_perf_osd_op_wip, the number of replication operations currently being processed (primary).
  • ceph_perf_osd_op_w_in_bytes, the number of bytes received from clients for write operations.
  • ceph_perf_osd_op_w_latency, the average latency in ms for write operations (including queue time).
  • ceph_perf_osd_op_w_process_latency, the average latency in ms for write operation (excluding queue time).
  • ceph_perf_osd_op_w_rlat, the average latency in ms for write operations with readable/applied.
  • ceph_perf_osd_recovery_ops, the number of recovery operations in progress.

Pacemaker

Resource location

  • pacemaker_resource_local_active, 1 when the resource is located on the host reporting the metric, if not, then 0. The metric contains a resource field which is one of ‘vip__public’, ‘vip__management’, ‘vip__vrouter_pub’, or ‘vip__vrouter’.

Clusters

The cluster metrics are emitted by the GSE plugins. For details, see Configuring alarms.

  • cluster_node_status, the status of the node cluster. The metric contains a cluster_name field that identifies the node cluster.
  • cluster_service_status, the status of the service cluster. The metric contains a cluster_name field that identifies the service cluster.
  • cluster_status, the status of the global cluster. The metric contains a cluster_name field that identifies the global cluster.

The supported values for these metrics are:

  • 0 for the Okay status.
  • 1 for the Warning status.
  • 2 for the Unknown status.
  • 3 for the Critical status.
  • 4 for the Down status.

Self-monitoring

System

The metrics have a service field with the name of the service it applies to. The values can be: hekad, collectd, influxd, grafana-server or elasticsearch.

  • lma_components_count_processes, the number of processes currently running.
  • lma_components_count_threads, the number of threads currently running.
  • lma_components_cputime_syst, the percentage of CPU time spent in system mode by the service. It can be greater than 100% when the node has more than one CPU.
  • lma_components_cputime_user, the percentage of CPU time spent in user mode by the service. It can be greater than 100% when the node has more than one CPU.
  • lma_components_disk_bytes_read, the number of bytes read from disk(s) per second.
  • lma_components_disk_bytes_write, the number of bytes written to disk(s) per second.
  • lma_components_disk_ops_read, the number of read operations from disk(s) per second.
  • lma_components_disk_ops_write, the number of write operations to disk(s) per second.
  • lma_components_memory_code, the physical memory devoted to executable code in bytes.
  • lma_components_memory_data, the physical memory devoted to other than executable code in bytes.
  • lma_components_memory_rss, the non-swapped physical memory used in bytes.
  • lma_components_memory_vm, the virtual memory size in bytes.
  • lma_components_pagefaults_majflt, major page faults per second.
  • lma_components_pagefaults_minflt, minor page faults per second.
  • lma_components_stacksize, the absolute value of the start address (the bottom) of the stack minus the address of the current stack pointer.

Heka pipeline

The metrics have two fields: name that contains the name of the decoder or filter as defined by Heka and type that is either decoder or filter.

The metrics for both types are as follows:

  • hekad_memory, the total memory in bytes used by the Sandbox.
  • hekad_msg_avg_duration, the average time in nanoseconds for processing the message.
  • hekad_msg_count, the total number of messages processed by the decoder. This resets to 0 when the process is restarted.

Additional metrics for filter type:

  • heakd_timer_event_avg_duration, the average time in nanoseconds for executing the timer_event function.
  • hekad_timer_event_count, the total number of executions of the timer_event function. This resets to 0 when the process is restarted.

Back-end checks

  • http_check, the API status of the back end, 1 if it is responsive, if not, then 0. The metric contains a service field that identifies the LMA back-end service being checked.

<service> is one of the following values, depending on which Fuel plugins are deployed in the environment:

  • ‘influxdb’

Elasticsearch

The following metrics represent the simple status on the health of the cluster. For details, see Cluster health.

  • elasticsearch_cluster_active_primary_shards, the number of active primary shards.
  • elasticsearch_cluster_active_shards, the number of active shards.
  • elasticsearch_cluster_health, the health status of the entire cluster where values 1, 2 , 3 represent green, yellow and red, respectively. The red status may also be reported when the Elasticsearch API returns an unexpected result, for example, a network failure.
  • elasticsearch_cluster_initializing_shards, the number of initializing shards.
  • elasticsearch_cluster_number_of_nodes, the number of nodes in the cluster.
  • elasticsearch_cluster_number_of_pending_tasks, the number of pending tasks.
  • elasticsearch_cluster_relocating_shards, the number of relocating shards.
  • elasticsearch_cluster_unassigned_shards, the number of unassigned shards.

InfluxDB

The following metrics are extracted from the output of the show stats command. The values are reset to zero when InfluxDB is restarted.

cluster

The following metrics are only available if there is more than one node in the cluster:

  • influxdb_cluster_write_shard_points_requests, the number of requests for writing a time series points to a shard.
  • influxdb_cluster_write_shard_requests, the number of requests for writing to a shard.

httpd

  • influxdb_httpd_failed_auths, the number of failed authentications.
  • influxdb_httpd_ping_requests, the number of ping requests.
  • influxdb_httpd_query_requests, the number of query requests received.
  • influxdb_httpd_query_response_bytes, the number of bytes returned to the client.
  • influxdb_httpd_requests, the number of requests received.
  • influxdb_httpd_write_points_ok, the number of points successfully written.
  • influxdb_httpd_write_request_bytes, the number of bytes received for write requests.
  • influxdb_httpd_write_requests, the number of write requests received.

write

  • influxdb_write_local_point_requests, the number of write points requests from the local data node.
  • influxdb_write_ok, the number of successful writes of consistency level.
  • influxdb_write_point_requests, the number of write points requests across all data nodes.
  • influxdb_write_remote_point_requests, the number of write points requests to remote data nodes.
  • influxdb_write_requests, the number of write requests across all data nodes.
  • influxdb_write_sub_ok, the number of successful points sent to subscriptions.

runtime

  • influxdb_garbage_collections, the number of garbage collections.
  • influxdb_go_routines, the number of Golang routines.
  • influxdb_heap_idle, the number of bytes in idle spans.
  • influxdb_heap_in_use, the number of bytes in non-idle spans.
  • influxdb_heap_objects, the total number of allocated objects.
  • influxdb_heap_released, the number of bytes released to the operating system.
  • influxdb_heap_system, the number of bytes obtained from the system.
  • influxdb_memory_alloc, the number of bytes allocated and not yet freed.
  • influxdb_memory_frees, the number of free operations.
  • influxdb_memory_lookups, the number of pointer lookups.
  • influxdb_memory_mallocs, the number of malloc operations.
  • influxdb_memory_system, the number of bytes obtained from the system.
  • influxdb_memory_total_alloc, the number of bytes allocated (even if freed).