Appendix B: List of metrics

This is the list of metrics that are emitted by the LMA collector service. They are listed by category then by metric name.

System

CPU

Metrics have a cpu_number field that contains the CPU number to which the metric applies.

  • cpu_idle, percentage of CPU time spent in the idle task.
  • cpu_interrupt, percentage of CPU time spent servicing interrupts.
  • cpu_nice, percentage of CPU time spent in user mode with low priority (nice).
  • cpu_softirq, percentage of CPU time spent servicing soft interrupts.
  • cpu_steal, percentage of CPU time spent in other operating systems.
  • cpu_system, percentage of CPU time spent in system mode.
  • cpu_user, percentage of CPU time spent in user mode.
  • cpu_wait, percentage of CPU time spent waiting for I/O operations to complete.

Disk

Metrics have a device field that contains the disk device number the metric applies to (eg ‘sda’, ‘sdb’ and so on).

  • disk_merged_read, the number of read operations per second that could be merged with already queued operations.
  • disk_merged_write, the number of write operations per second that could be merged with already queued operations.
  • disk_octets_read, the number of octets (bytes) read per second.
  • disk_octets_write, the number of octets (bytes) written per second.
  • disk_ops_read, the number of read operations per second.
  • disk_ops_write, the number of write operations per second.
  • disk_time_read, the average time for a read operation to complete in the last interval.
  • disk_time_write, the average time for a write operation to complete in the last interval.

File system

Metrics have a fs field that contains the partition’s mount point to which the metric applies (eg ‘/’, ‘/var/lib’ and so on).

  • fs_inodes_free, the number of free inodes on the file system.
  • fs_inodes_reserved, the number of reserved inodes.
  • fs_inodes_used, the number of used inodes.
  • fs_space_free, the number of free bytes.
  • fs_space_reserved, the number of reserved bytes.
  • fs_space_used, the number of used bytes.
  • fs_inodes_percent_free, the percentage of free inodes on the file system.
  • fs_inodes_percent_reserved, the percentage of reserved inodes.
  • fs_inodes_percent_used, the percentage of used inodes.
  • fs_space_percent_free, the percentage of free bytes.
  • fs_space_percent_reserved, the percentage of reserved bytes.
  • fs_space_percent_used, the percentage of used bytes.

System load

  • load_longterm, the system load average over the last 15 minutes.
  • load_midterm, the system load average over the last 5 minutes.
  • load_shortterm, the system load averge over the last minute.

Memory

  • memory_buffered, the amount of memory (in bytes) which is buffered.
  • memory_cached, the amount of memory (in bytes) which is cached.
  • memory_free, the amount of memory (in bytes) which is free.
  • memory_used, the amount of memory (in bytes) which is used.

Network

Metrics have a interface field that contains the interface name the metric applies to (eg ‘eth0’, ‘eth1’ and so on).

  • if_errors_rx, the number of errors per second detected when receiving from the interface.
  • if_errors_tx, the number of errors per second detected when transmitting from the interface.
  • if_octets_rx, the number of octets (bytes) received per second by the interface.
  • if_octets_tx, the number of octets (bytes) transmitted per second by the interface.
  • if_packets_rx, the number of packets received per second by the interface.
  • if_packets_tx, the number of packets transmitted per second by the interface.

Processes

  • processes_fork_rate, the number of processes forked per second.
  • processes_count, the number of processes in a given state. The metric has a state field (one of ‘blocked’, ‘paging’, ‘running’, ‘sleeping’, ‘stopped’ or ‘zombies’).

Swap

  • swap_cached, the amount of cached memory (in bytes) which is in the swap.
  • swap_free, the amount of free memory (in bytes) which is in the swap.
  • swap_used, the amount of used memory (in bytes) which is in the swap.
  • swap_io_in, the number of swap pages written per second.
  • swap_io_out, the number of swap pages read per second.

Users

  • logged_users, the number of users currently logged-in.

Apache

  • apache_bytes, the number of bytes per second transmitted by the server.
  • apache_requests, the number of requests processed per second.
  • apache_connections, the current number of active connections.
  • apache_idle_workers, the current number of idle workers.
  • apache_workers_closing, the number of workers in closing state.
  • apache_workers_dnslookup, the number of workers in DNS lookup state.
  • apache_workers_finishing, the number of workers in finishing state.
  • apache_workers_idle_cleanup, the number of workers in idle cleanup state.
  • apache_workers_keepalive, the number of workers in keepalive state.
  • apache_workers_logging, the number of workers in logging state.
  • apache_workers_open, the number of workers in open state.
  • apache_workers_reading, the number of workers in reading state.
  • apache_workers_sending, the number of workers in sending state.
  • apache_workers_starting, the number of workers in starting state.
  • apache_workers_waiting, the number of workers in waiting state.

MySQL

Commands

mysql_commands, the number of times per second a given statement has been executed. The metric has a command field that contains the statement to which it applies. The values can be:

  • change_db for the USE statement.
  • commit for the COMMIT statement.
  • flush for the FLUSH statement.
  • insert for the INSERT statement.
  • rollback for the ROLLBACK statement.
  • select for the SELECT statement.
  • set_option for the SET statement.
  • show_collations for the SHOW COLLATION statement.
  • show_databases for the SHOW DATABASES statement.
  • show_fields for the SHOW FIELDS statement.
  • show_master_status for the SHOW MASTER STATUS statement.
  • show_status for the SHOW STATUS statement.
  • show_tables for the SHOW TABLES statement.
  • show_variables for the SHOW VARIABLES statement.
  • show_warnings for the SHOW WARNINGS statement.
  • update for the UPDATE statement.

Handlers

mysql_handler, the number of times per second a given handler has been executed. The metric has a handler field that contains the handler it applies to. The values can be:

  • commit for the internal COMMIT statements.
  • delete for the internal DELETE statements.
  • external_lock for the external locks.
  • read_first for the requests that read the first entry in an index.
  • read_key for the requests that read a row based on a key.
  • read_next for the requests that read the next row in key order.
  • read_prev for the requests that read the previous row in key order.
  • read_rnd for the requests that read a row based on a fixed position.
  • read_rnd_next for the requests that read the next row in the data file.
  • rollback the requests that perform rollback operation.
  • update the requests that update a row in a table.
  • write the requests that insert a row in a table.

Locks

  • mysql_locks_immediate, the number of times per second the requests for table locks could be granted immediately.
  • mysql_locks_waited, the number of times per second the requests for table locks had to wait.

Network

  • mysql_octets_rx, the number of bytes received per second by the server.
  • mysql_octets_tx, the number of bytes sent per second by the server.

Threads

  • mysql_threads_cached, the number of threads in the thread cache.
  • mysql_threads_connected, the number of currently open connections.
  • mysql_threads_running, the number of threads that are not sleeping.
  • mysql_threads_created, the number of threads created per second to handle connections.

Cluster

These metrics are collected with statement ‘SHOW STATUS’. see Percona documentation for further details.

  • mysql_cluster_size, current number of nodes in the cluster.
  • mysql_cluster_status, 1 when the node is ‘Primary’, 2 if ‘Non-Primary’ and 3 if ‘Disconnected’.
  • mysql_cluster_connected, 1 when the node is connected to the cluster, if not 0.
  • mysql_cluster_ready, 1 when the node is ready to accept queries, if not 0.
  • mysql_cluster_local_commits, number of writesets commited on the node.
  • mysql_cluster_received_bytes, total size in bytes of writesets received from other nodes.
  • mysql_cluster_received, total number of writesets received from other nodes.
  • mysql_cluster_replicated_bytes total size in bytes of writesets sent to other nodes.
  • mysql_cluster_replicated, total number of writesets sent to other nodes.
  • mysql_cluster_local_cert_failures, number of writesets that failed the certification test.
  • mysql_cluster_local_send_queue, the number of writesets waiting to be sent.
  • mysql_cluster_local_recv_queue, the number of writesets waiting to be applied.

Slow Queries

This metric is collected with statement ‘SHOW STATUS where Variable_name = ‘Slow_queries’.

  • mysql_slow_queries, number of queries that have taken more than X seconds, depending of the MySQL configuration parameter ‘long_query_time’ (10s per default)

RabbitMQ

Cluster

  • rabbitmq_connections, total number of connections.
  • rabbitmq_consumers, total number of consumers.
  • rabbitmq_exchanges, total number of exchanges.
  • rabbitmq_memory, bytes of memory consumed by the Erlang process associated with all queues, including stack, heap and internal structures.
  • rabbitmq_used_memory, bytes of memory used by the whole RabbitMQ process.
  • rabbitmq_remaining_memory, the difference between rabbitmq_vm_memory_limit and rabbitmq_used_memory.
  • rabbitmq_messages, total number of messages which are ready to be consumed or not yet acknowledged.
  • rabbitmq_total_nodes, total number of nodes in the cluster.
  • rabbitmq_running_nodes, total number of running nodes in the cluster.
  • rabbitmq_queues, total number of queues.
  • rabbitmq_unmirrored_queues, total number of queues that are not mirrored.
  • rabbitmq_vm_memory_limit, the maximum amount of memory allocated for RabbitMQ. When rabbitmq_used_memory uses more than this value, all producers are blocked.
  • rabbitmq_disk_free_limit, the minimum amount of free disk for RabbitMQ. When rabbitmq_disk_free drops below this value, all producers are blocked.
  • rabbitmq_disk_free, the disk free space.
  • rabbitmq_remaining_disk, the difference between rabbitmq_disk_free and rabbitmq_disk_free_limit.

Queues

All metrics have a queue field which contains the name of the RabbitMQ queue.

  • rabbitmq_queue_consumers, number of consumers for a given queue.
  • rabbitmq_queue_memory, bytes of memory consumed by the Erlang process associated with the queue, including stack, heap and internal structures.
  • rabbitmq_queue_messages, number of messages which are ready to be consumed or not yet acknowledged for the given queue.

HAProxy

frontend and backend field values can be:

  • cinder-api
  • glance-api
  • glance-registry-api
  • heat-api
  • heat-cfn-api
  • heat-cloudwatch-api
  • horizon-web (when Horizon is deployed without TLS)
  • horizon-https (when Horizon is deployed with TLS)
  • keystone-public-api
  • keystone-admin-api
  • mysqld-tcp
  • murano-api
  • neutron-api
  • nova-api
  • nova-ec2-api
  • nova-metadata-api
  • nova-novncproxy-websocket
  • sahara-api
  • swift-api

Server

  • haproxy_connections, the number of current connections.
  • haproxy_ssl_connections, the number of current SSL connections.
  • haproxy_pipes_free, the number of free pipes.
  • haproxy_pipes_used, the number of used pipes.
  • haproxy_run_queue, the number of connections waiting in the queue.
  • haproxy_tasks, the number of tasks.
  • haproxy_uptime, the HAProxy server uptime in seconds.

Frontends

  • haproxy_frontend_bytes_in, the total number of bytes received by all frontends.
  • haproxy_frontend_bytes_out, the total number of bytes transmitted by all frontends.
  • haproxy_frontend_session_current, the total number of current sessions for all frontends.

The following metrics have a frontend field that contains the name of the frontend server.

  • haproxy_frontend_bytes_in, the number of bytes received by the frontend.
  • haproxy_frontend_bytes_out, the number of bytes transmitted by the frontend.
  • haproxy_frontend_denied_requests, the number of denied requests.
  • haproxy_frontend_denied_responses, the number of denied responses.
  • haproxy_frontend_error_requests, the number of error requests.
  • haproxy_frontend_response_1xx, the number of HTTP responses with 1xx code.
  • haproxy_frontend_response_2xx, the number of HTTP responses with 2xx code.
  • haproxy_frontend_response_3xx, the number of HTTP responses with 3xx code.
  • haproxy_frontend_response_4xx, the number of HTTP responses with 4xx code.
  • haproxy_frontend_response_5xx, the number of HTTP responses with 5xx code.
  • haproxy_frontend_response_other, the number of HTTP responses with other code.
  • haproxy_frontend_session_current, the number of current sessions.
  • haproxy_frontend_session_total, the cumulative number of sessions.

Backends

  • haproxy_backend_bytes_in, the total number of bytes received by all backends.
  • haproxy_backend_bytes_out, the total number of bytes transmitted by all backends.
  • haproxy_backend_queue_current, the total number of requests in queue for all backends.
  • haproxy_backend_session_current, the total number of current sessions for all backends.
  • haproxy_backend_error_responses, the total number of error responses for all backends.

The following metrics have a backend field that contains the name of the backend server.

  • haproxy_backend_bytes_in, the number of bytes received by the backend.
  • haproxy_backend_bytes_out, the number of bytes transmitted by the backend.
  • haproxy_backend_denied_requests, the number of denied requests.
  • haproxy_backend_denied_responses, the number of denied responses.
  • haproxy_backend_downtime, the total downtime in second.
  • haproxy_backend_status, the global backend status where values 0 and 1 represent respectively DOWN (all backends are down) and UP (at least one backend is up).
  • haproxy_backend_error_connection, the number of error connections.
  • haproxy_backend_error_responses, the number of error responses.
  • haproxy_backend_queue_current, the number of requests in queue.
  • haproxy_backend_redistributed, the number of times a request was redispatched to another server.
  • haproxy_backend_response_1xx, the number of HTTP responses with 1xx code.
  • haproxy_backend_response_2xx, the number of HTTP responses with 2xx code.
  • haproxy_backend_response_3xx, the number of HTTP responses with 3xx code.
  • haproxy_backend_response_4xx, the number of HTTP responses with 4xx code.
  • haproxy_backend_response_5xx, the number of HTTP responses with 5xx code.
  • haproxy_backend_response_other, the number of HTTP responses with other code.
  • haproxy_backend_retries, the number of times a connection to a server was retried.
  • haproxy_backend_servers, the count of servers grouped by state. This metric has an additional state field that contains the state of the backends (either ‘down’ or ‘up’).
  • haproxy_backend_session_current, the number of current sessions.
  • haproxy_backend_session_total, the cumulative number of sessions.

Memcached

  • memcached_command_flush, cumulative number of flush reqs.
  • memcached_command_get, cumulative number of retrieval reqs.
  • memcached_command_set, cumulative number of storage reqs.
  • memcached_command_touch, cumulative number of touch reqs.
  • memcached_connections_current, number of open connections.
  • memcached_items_current, current number of items stored.
  • memcached_octets_rx, total number of bytes read by this server from network.
  • memcached_octets_tx, total number of bytes sent by this server to network.
  • memcached_ops_decr_hits, number of successful decr reqs.
  • memcached_ops_decr_misses, number of decr reqs against missing keys.
  • memcached_ops_evictions, number of valid items removed from cache to free memory for new items.
  • memcached_ops_hits, number of keys that have been requested.
  • memcached_ops_incr_hits, number of successful incr reqs.
  • memcached_ops_incr_misses, number of successful incr reqs.
  • memcached_ops_misses, number of items that have been requested and not found.
  • memcached_df_cache_used, current number of bytes used to store items.
  • memcached_df_cache_free, current number of free bytes to store items.
  • memcached_percent_hitratio, percentage of get command hits (in cache).

See memcached documentation for further details.

Libvirt

Every metric contains an instance_id field which is the UUID of the instance for the Nova service.

CPU

  • virt_cpu_time, the average amount of CPU time (in nanoseconds) allocated to the virtual instance in a second.
  • virt_vcpu_time, the average amount of CPU time (in nanoseconds) allocated to the virtual CPU in a second. The metric contains a vcpu_number field which is the virtual CPU number.

Disk

Metrics have a device field that contains the virtual disk device to which the metric applies (eg ‘vda’, ‘vdb’ and so on).

  • virt_disk_octets_read, the number of octets (bytes) read per second.
  • virt_disk_octets_write, the number of octets (bytes) written per second.
  • virt_disk_ops_read, the number of read operations per second.
  • virt_disk_ops_write, the number of write operations per second.

Memory

  • virt_memory_total, the total amount of memory (in bytes) allocated to the virtual instance.

Network

Metrics have an interface field that contains the interface name to which the metric applies (eg ‘tap0dc043a6-dd’, ‘tap769b123a-2e’ and so on).

  • virt_if_dropped_rx, the number of dropped packets per second when receiving from the interface.
  • virt_if_dropped_tx, the number of dropped packets per second when transmitting from the interface.
  • virt_if_errors_rx, the number of errors per second detected when receiving from the interface.
  • virt_if_errors_tx, the number of errors per second detected when transmitting from the interface.
  • virt_if_octets_rx, the number of octets (bytes) received per second by the interface.
  • virt_if_octets_tx, the number of octets (bytes) transmitted per second by the interface.
  • virt_if_packets_rx, the number of packets received per second by the interface.
  • virt_if_packets_tx, the number of packets transmitted per second by the interface.

OpenStack

Service checks

  • openstack_check_api, the service’s API status, 1 if it is responsive, if not 0.

    The metric contains a service field that identifies the OpenStack service being checked.

<service> is one of the following values with their respective resource checks:

  • ‘nova-api’: ‘/’
  • ‘cinder-api’: ‘/’
  • ‘cinder-v2-api’: ‘/’
  • ‘glance-api’: ‘/’
  • ‘heat-api’: ‘/’
  • ‘heat-cfn-api’: ‘/’
  • ‘keystone-public-api’: ‘/’
  • ‘neutron-api’: ‘/’
  • ‘ceilometer-api’: ‘/v2/capabilities’
  • ‘swift-api’: ‘/healthcheck’
  • ‘swift-s3-api’: ‘/healthcheck’

Note

All checks are performed without authentication except for Ceilometer.

Compute

These metrics are emitted per compute node.

  • openstack_nova_instance_creation_time, the time (in seconds) it took to launch a new instance.
  • openstack_nova_instance_state, the number of instances which entered a given state (the value is always 1). The metric contains a state field.
  • openstack_nova_free_disk, the disk space (in GB) available for new instances.
  • openstack_nova_used_disk, the disk space (in GB) used by the instances.
  • openstack_nova_free_ram, the memory (in MB) available for new instances.
  • openstack_nova_used_ram, the memory (in MB) used by the instances.
  • openstack_nova_free_vcpus, the number of virtual CPU available for new instances.
  • openstack_nova_used_vcpus, the number of virtual CPU used by the instances.
  • openstack_nova_running_instances, the number of running instances.
  • openstack_nova_running_tasks, the number of tasks currently executed.

These metrics are retrieved from the Nova API and represent the aggregated values across all compute nodes.

  • openstack_nova_total_free_disk, the total amount of disk space (in GB) available for new instances.
  • openstack_nova_total_used_disk, the total amount of disk space (in GB) used by the instances.
  • openstack_nova_total_free_ram, the total amount of memory (in MB) available for new instances.
  • openstack_nova_total_used_ram, the total amount of memory (in MB) used by the instances.
  • openstack_nova_total_free_vcpus, the total number of virtual CPU available for new instances.
  • openstack_nova_total_used_vcpus, the total number of virtual CPU used by the instances.
  • openstack_nova_total_running_instances, the total number of running instances.
  • openstack_nova_total_running_tasks, the total number of tasks currently executed.

These metrics are retrieved from the Nova API.

  • openstack_nova_instances, the total count of instances in a given state. The metric contains a state field which is one of ‘active’, ‘deleted’, ‘error’, ‘paused’, ‘resumed’, ‘rescued’, ‘resized’, ‘shelved_offloaded’ or ‘suspended’.

These metrics are retrieved from the Nova database.

  • openstack_nova_services, the total count of Nova services by state. The metric contains a service field (one of ‘compute’, ‘conductor’, ‘scheduler’, ‘cert’ or ‘consoleauth’) and a state field (one of ‘up’, ‘down’ or ‘disabled’).
  • openstack_nova_service, the Nova service state (either 0 for ‘up’, 1 for ‘down’ or 2 for ‘disabled’). The metric contains a service field (one of ‘compute’, ‘conductor’, ‘scheduler’, ‘cert’ or ‘consoleauth’) and a state field (one of ‘up’, ‘down’ or ‘disabled’).

Identity

These metrics are retrieved from the Keystone API.

  • openstack_keystone_roles, the total number of roles.
  • openstack_keystone_tenants, the number of tenants by state. The metric contains a state field (either ‘enabled’ or ‘disabled’).
  • openstack_keystone_users, the number of users by state. The metric contains a state field (either ‘enabled’ or ‘disabled’).

Volume

These metrics are emitted per volume node.

  • openstack_cinder_volume_creation_time, the time (in seconds) it took to create a new volume.

Note

When using Ceph as the backend storage for volumes, the hostname value is always set to rbd.

These metrics are retrieved from the Cinder API.

  • openstack_cinder_volumes, the number of volumes by state. The metric contains a state field.
  • openstack_cinder_snapshots, the number of snapshots by state. The metric contains a state field.
  • openstack_cinder_volumes_size, the total size (in bytes) of volumes by state. The metric contains a state field.
  • openstack_cinder_snapshots_size, the total size (in bytes) of snapshots by state. The metric contains a state field.

state is one of ‘available’, ‘creating’, ‘attaching’, ‘in-use’, ‘deleting’, ‘backing-up’, ‘restoring-backup’, ‘error’, ‘error_deleting’, ‘error_restoring’, ‘error_extending’.

These metrics are retrieved from the Cinder database.

  • openstack_cinder_services, the total count of Cinder services by state. The metric contains a service field (one of ‘volume’, ‘backup’, ‘scheduler’) and a state field (one of ‘up’, ‘down’ or ‘disabled’).
  • openstack_cinder_service, the Cinder service state (either 0 for ‘up’, 1 for ‘down’ or 2 for ‘disabled’). The metric contains a service field (one of ‘volume’, ‘backup’, ‘scheduler’), and a state field (one of ‘up’, ‘down’ or ‘disabled’).

Image

These metrics are retrieved from the Glance API.

  • openstack_glance_images, the number of images by state and visibility. The metric contains state and visibility field.
  • openstack_glance_snapshots, the number of snapshot images by state and visibility. The metric contains state and visibility field.
  • openstack_glance_images_size, the total size (in bytes) of images by state and visibility. The metric contains state and visibility field.
  • openstack_glance_snapshots_size, the total size (in bytes) of snapshots by state and visibility. The metric contains state and visibility field.

state is one of ‘queued’, ‘saving’, ‘active’, ‘killed’, ‘deleted’, ‘pending_delete’. visibility is either ‘public’ or ‘private’.

Network

These metrics are retrieved from the Neutron API.

  • openstack_neutron_networks, the number of virtual networks by state. The metric contains a state field.
  • openstack_neutron_subnets, the number of virtual subnets.
  • openstack_neutron_ports, the number of virtual ports by owner and state. The metric contains owner and state fields.
  • openstack_neutron_routers, the number of virtual routers by state. The metric contains a state field.
  • openstack_neutron_floatingips, the total number of floating IP addresses.

<state> is one of ‘active’, ‘build’, ‘down’ or ‘error’.

<owner> is one of ‘compute’, ‘dhcp’, ‘floatingip’, ‘floatingip_agent_gateway’, ‘router_interface’, ‘router_gateway’, ‘router_ha_interface’, ‘router_interface_distributed’ or ‘router_centralized_snat’.

These metrics are retrieved from the Neutron database.

Note

These metrics are not collected when the Contrail plugin is deployed.

  • openstack_neutron_agents, the total number of Neutron agents by service and state. The metric contains service (one of ‘dhcp’, ‘l3’, ‘metadata’ or ‘openvswitch’) and state (one of ‘up’, ‘down’ or ‘disabled’) fields.
  • openstack_neutron_agent, the Neutron agent state (either 0 for ‘up’, 1 for ‘down’ or 2 for ‘disabled’). The metric contains a service field (one of ‘dhcp’, ‘l3’, ‘metadata’ or ‘openvswitch’), and a state field (one of ‘up’, ‘down’ or ‘disabled’).

API response times

  • openstack_<service>_http_responses, the time (in second) it took to serve the HTTP request. The metric contains http_method (eg ‘GET’, ‘POST’, and so forth) and http_status (eg ‘200’, ‘404’, and so forth) fields.

<service> is one of ‘cinder’, ‘glance’, ‘heat’ ‘keystone’, ‘neutron’ or ‘nova’.

Logs

  • log_messages, the number of log messages per second for the given service and severity level. The metric contains service and severity (one of ‘debug’, ‘info’, ... ) fields.

Ceph

All Ceph metrics have a cluster field containing the name of the Ceph cluster (ceph by default).

See cluster monitoring and RADOS monitoring for further details.

Cluster

  • ceph_health, the health status of the entire cluster where values 1, 2 , 3 represent respectively OK, WARNING and ERROR.
  • ceph_monitor_count, number of ceph-mon processes.
  • ceph_quorum_count, number of ceph-mon processes participating in the quorum.

Pools

  • ceph_pool_total_bytes, total number of bytes for all pools.
  • ceph_pool_total_used_bytes, total used size in bytes by all pools.
  • ceph_pool_total_avail_bytes, total available size in bytes for all pools.
  • ceph_pool_total_number, total number of pools.

The folllowing metrics have a pool field that contains the name of the Ceph pool.

  • ceph_pool_bytes_used, amount of data in bytes used by the pool.
  • ceph_pool_max_avail, available size in bytes for the pool.
  • ceph_pool_objects, number of objects in the pool.
  • ceph_pool_read_bytes_sec, number of bytes read by second for the pool.
  • ceph_pool_write_bytes_sec, number of bytes written by second for the pool.
  • ceph_pool_op_per_sec, number of operations per second for the pool.
  • ceph_pool_size, number of data replications for the pool.
  • ceph_pool_pg_num, number of placement groups for the pool.

Placement Groups

  • ceph_pg_total, total number of placement groups.
  • ceph_pg_bytes_avail, available size in bytes.
  • ceph_pg_bytes_total, cluster total size in bytes.
  • ceph_pg_bytes_used, data stored size in bytes.
  • ceph_pg_data_bytes, stored data size in bytes before it is replicated, cloned or snapshotted.
  • ceph_pg_state, number of placement groups in a given state. The metric contains a state field whose value is <state> is a combination separated by + of 2 or more states of this list: creating, active, clean, down, replay, splitting, scrubbing, degraded, inconsistent, peering, repair, recovering, recovery_wait, backfill, backfill-wait, backfill_toofull, incomplete, stale, remapped.

OSD Daemons

  • ceph_osd_up, number of OSD daemons UP.
  • ceph_osd_down, number of OSD daemons DOWN.
  • ceph_osd_in, number of OSD daemons IN.
  • ceph_osd_out, number of OSD daemons OUT.

The following metrics have an osd field that contains the OSD identifier.

  • ceph_osd_used, data stored size in bytes for the given OSD.
  • ceph_osd_total, total size in bytes for the given OSD.
  • ceph_osd_apply_latency, apply latency in ms for the given OSD.
  • ceph_osd_commit_latency, commit latency in ms for the given OSD.

OSD Performance

All the following metrics are retrieved per OSD daemon from the corresponding socket /var/run/ceph/ceph-osd.<ID>.asok by issuing the command perf dump.

All metrics have an osd field that contains the OSD identifier.

Note

These metrics are not collected when a node has both the ceph-osd and controller roles.

See OSD performance counters for further details.

  • ceph_perf_osd_recovery_ops, number of recovery operations in progress.
  • ceph_perf_osd_op_wip, number of replication operations currently being processed (primary).
  • ceph_perf_osd_op, number of client operations.
  • ceph_perf_osd_op_in_bytes, number of bytes received from clients for write operations.
  • ceph_perf_osd_op_out_bytes, number of bytes sent to clients for read operations.
  • ceph_perf_osd_op_latency, average latency in ms for client operations (including queue time).
  • ceph_perf_osd_op_process_latency, average latency in ms for client operations (excluding queue time).
  • ceph_perf_osd_op_r, number of client read operations.
  • ceph_perf_osd_op_r_out_bytes, number of bytes sent to clients for read operations.
  • ceph_perf_osd_op_r_latency, average latency in ms for read operation (including queue time).
  • ceph_perf_osd_op_r_process_latency, average latency in ms for read operation (excluding queue time).
  • ceph_perf_osd_op_w, number of client write operations.
  • ceph_perf_osd_op_w_in_bytes, number of bytes received from clients for write operations.
  • ceph_perf_osd_op_w_rlat, average latency in ms for write operations with readable/applied.
  • ceph_perf_osd_op_w_latency, average latency in ms for write operations (including queue time).
  • ceph_perf_osd_op_w_process_latency, average latency in ms for write operation (excluding queue time).
  • ceph_perf_osd_op_rw, number of client read-modify-write operations.
  • ceph_perf_osd_op_rw_in_bytes, number of bytes per second received from clients for read-modify-write operations.
  • ceph_perf_osd_op_rw_out_bytes, number of bytes per second sent to clients for read-modify-write operations.
  • ceph_perf_osd_op_rw_rlat, average latency in ms for read-modify-write operations with readable/applied.
  • ceph_perf_osd_op_rw_latency, average latency in ms for read-modify-write operations (including queue time).
  • ceph_perf_osd_op_rw_process_latency, average latency in ms for read-modify-write operations (excluding queue time).

Pacemaker

Resource location

  • pacemaker_resource_local_active, 1 when the resource is located on the host reporting the metric, if not 0. The metric contains a resource field which is one of ‘vip__public’, ‘vip__management’, ‘vip__vrouter_pub’ or ‘vip__vrouter’.

Clusters

The cluster metrics are emitted by the GSE plugins (See the Alarms Configuration Guide for details).

  • cluster_service_status, the status of the service cluster. The metric contains a cluster_name field that identifies the service cluster.
  • cluster_node_status, the status of the node cluster. The metric contains a cluster_name field that identifies the node cluster.
  • cluster_status, the status of the global cluster. The metric contains a cluster_name field that identifies the global cluster.

The supported values for these metrics are:

  • 0 for the Okay status.
  • 1 for the Warning status.
  • 2 for the Unknown status.
  • 3 for the Critical status.
  • 4 for the Down status.

LMA self-monitoring

System

Metrics have a service field with the name of the service it applies to. Values can be: hekad, collectd, influxd, grafana-server or elasticsearch.

  • lma_components_count_processes, number of processes currently running.
  • lma_components_count_threads, number of threads currently running.
  • lma_components_cputime_user, percentage of CPU time spent in user mode by the service. It can be greater than 100% when the node has more than one CPU.
  • lma_components_cputime_syst, percentage of CPU time spent in system mode by the service. It can be greater than 100% when the node has more than one CPU.
  • lma_components_disk_bytes_read, number of bytes read from disk(s) per second.
  • lma_components_disk_bytes_write, number of bytes written to disk(s) per second.
  • lma_components_disk_ops_read, number of read operations from disk(s) per second.
  • lma_components_disk_ops_write, number of write operations to disk(s) per second.
  • lma_components_memory_code, physical memory devoted to executable code (bytes).
  • lma_components_memory_data, physical memory devoted to other than executable code (bytes).
  • lma_components_memory_rss, non-swapped physical memory used (bytes).
  • lma_components_memory_vm, virtual memory size (bytes).
  • lma_components_pagefaults_minflt, minor page faults per second.
  • lma_components_pagefaults_majflt, major page faults per second.
  • lma_components_stacksize, absolute value of the start address (the bottom) of the stack minus the address of the current stack pointer.

Heka pipeline

Metrics have two fields: name that contains the name of the decoder or filter as defined by Heka and type that is either decoder or filter.

Metrics for both types:

  • hekad_msg_avg_duration, the average time for processing the message (in nanoseconds).
  • hekad_msg_count, the total number of messages processed by the decoder. This will reset to 0 when the process is restarted.
  • hekad_memory, the total memory used by the Sandbox (in bytes).

Additional metrics for filter type:

  • heakd_timer_event_avg_duration, the average time for executing the timer_event function (in nanoseconds).
  • hekad_timer_event_count, the total number of executions of the timer_event function. This will reset to 0 when the process is restarted.

Backend checks

  • http_check, the backend’s API status, 1 if it is responsive, if not 0. The metric contains a service field that identifies the LMA backend service being checked.

<service> is one of the following values (depending of which Fuel plugins are deployed in the environment):

  • ‘influxdb’

Elasticsearch

The following metrics represent the simple status on the health of the cluster. See cluster health for further details.

  • elasticsearch_cluster_health, the health status of the entire cluster where values 1, 2 , 3 represent respectively green, yellow and red. The red status may also be reported when the Elasticsearch API returns an unexpected result (network failure for instance).
  • elasticsearch_cluster_active_primary_shards, the number of active primary shards.
  • elasticsearch_cluster_active_shards, the number of active shards.
  • elasticsearch_cluster_initializing_shards, the number of initializing shards.
  • elasticsearch_cluster_number_of_nodes, the number of nodes in the cluster.
  • elasticsearch_cluster_number_of_pending_tasks, the number of pending tasks.
  • elasticsearch_cluster_relocating_shards, the number of relocating shards.
  • elasticsearch_cluster_unassigned_shards, the number of unassigned shards.

InfluxDB

The following metrics are extracted from the output of show stats command. The values are reset to zero when InfluxDB is restarted.

cluster

These metrics are only available if there are more than one node in the cluster.

  • influxdb_cluster_write_shard_points_requests, the number of requests for writing a time series points to a shard.
  • influxdb_cluster_write_shard_requests, the number of requests for writing to a shard.

httpd

  • influxdb_httpd_failed_auths, the number of times failed authentications.
  • influxdb_httpd_ping_requests, the number of ping requests.
  • influxdb_httpd_write_points_ok, the number of points successfully written.
  • influxdb_httpd_query_requests, the number of query requests received.
  • influxdb_httpd_query_response_bytes, the number of bytes returned to the client.
  • influxdb_httpd_requests, the number of requests received.
  • influxdb_httpd_write_requests, the number of write requests received.
  • influxdb_httpd_write_request_bytes, the number of bytes received for write requests.

write

  • influxdb_write_point_requests, the number of write points requests across all data nodes.
  • influxdb_write_local_point_requests, the number of write points requests from the local data node.
  • influxdb_write_remote_point_requests, the number of write points requests to remote data nodes.
  • influxdb_write_requests, the number of write requests across all data nodes.
  • influxdb_write_sub_ok, the number of successful points send to subscriptions.
  • influxdb_write_ok, the number of successful writes of consistency level.

runtime

  • influxdb_memory_alloc, the number of bytes allocated and not yet freed.
  • influxdb_memory_total_alloc, the number of bytes allocated (even if freed).
  • influxdb_memory_system, the number of bytes obtained from the system.
  • influxdb_memory_lookups, the number of pointer lookups.
  • influxdb_memory_mallocs, the number of malloc operations.
  • influxdb_memory_frees, the number of free operations.
  • influxdb_heap_idle, the number of bytes in idle spans.
  • influxdb_heap_in_use, the number of bytes in non-idle spans.
  • influxdb_heap_objects, the total number of allocated objects.
  • influxdb_heap_released, the number of bytes released to the operating system.
  • influxdb_heap_system, the number of bytes obtained from the system.
  • influxdb_garbage_collections, the number of garbage collections.
  • influxdb_go_routines, the number of Golang routines.