At CERN, we developed a fabric monitoring tool called Lemon around 10 years ago while the computing infrastructure for the Large Hadron Collider was being commissioned.
As in many areas, modern Open Source tools are now providing equivalent function without the effort of maintaining the entire software stack and benefiting from the experience of others.
As part of the ongoing work to align the CERN computing fabric management to open source tools with strong community support, we have investigated a monitoring tool chain which would allow us to consolidate logs, perform analysis for trends and provide dashboards to assist in problem determination.
Given our scalability objectives, we can expect O(100K) endpoints with multiple data sources to be correlated from PDUs, machines and guest instances.
A team looking into a CERN IT reference architecture have divided the problem into 4 areas
- Transport - how to get the data from the client machine to where can be analysed with the aim of a low overhead per client. Flume was chosen as the implementation for this part after a comparison with ActiveMQ (which we use in other areas such as monitoring notifications). Flume was chosen as there were many pre-existing connectors for sources and sinks.
- Tagging - where we can associate additional external metadata with the data so each entry is self describing. Elastic Search was the selection here.
- Dashboard - where queries on criteria are displayed in an intuitive way (such as time series or pie charts) on a service specific dashboard. Kibana was the natural choice in view of the ease of integration with Elastic Search.
- Long term data repository to allow offline analysis and trending when new areas of investigation are needed. HDFS was the clear choice for unstructured data and we had a number of other projects at CERN that require it's function too.
Using this architecture, we have implemented the CERN private cloud log handling, taking the logs from the 700 hypervisors and controllers around the CERN cloud and consolidating them into a set of dashboards.
Our requirements are
- Have a centralized copy of our logs to ease problem investigation
- Display OpenStack usage statistics such as management dashboards
- Show the results of functional tests and probes to the production cloud
- Maintain a long term history of the infrastructure status and evolution
- Monitor the state of our databases
The Elastic Search configuration is a 14 node cluster with 11 data nodes and 3 HTTP nodes configured using Puppet and running, naturally, on VMs in the CERN private cloud. We have 5 shares per index with 2 replicas per shard.
Kibana is running on 3 nodes, with Shibboleth authentication to integrate to the CERN Single Sign On system.
A number of dashboards have been created. A Nova API dashboard shows the usage
An Active User dashboard helps us to identify if there are heavy users causing disturbance.
The dashboards themselves can be created dynamically without needing an in-depth knowledge of the monitoring infrastructure.
This work has been implemented by the CERN monitoring and cloud teams and thanks for the support of the Flume, Elastic Search and Kibana communities.