LHC Tunnel

LHC Tunnel

Monday, 23 March 2015

Not all cores are created equal

Within CERN's compute cloud, the hypervisors vary significantly in performance. We generally run the servers for around 5 years before retirement and there are around 3 different configurations selected each year through public procurement.

Benchmarking in High Energy Physics is done using a benchmark suite called HEPSpec 2006 (HS06). This is based on the C++ programs within the Spec 2006 suite run in parallel according the number of cores in the server. The performance range is around a factor of 3 between the slowest and the fastest machines [1].  


When machines are evaluated after delivery, the HS06 rating for each hardware configuration is saved into a hardware inventory database.

Defining a flavor for each hardware type was not attractive as there are 15 different configurations to consider and users would not easily find out which flavors have free cores. Instead, users ask for the standard flavors, such as m1.small 1 core virtual machine, you could land on a hypervisor giving 6 HS06 or one giving 16. However, the accounting and quotas is done using virtual cores so the 6 and 16 HS06 virtual cores are considered equivalent.

in order to improve our accounting, we therefore wanted to provide the performance of the VM along with the metering records giving the CPU usage through ceilometer. Initially, we thought that this would require some additional code to be added to ceilometer but this is actually possible using the standard ceilometer functions with transformers and publishers.


The following approach was implemented.
  • On the hypervisor, we added an additional meter 'hs06' which provides the CPU rating of the VM normalised by the HS06 performance of the hypervisor. This value is determined using the HS06 value stored in the Hardware Database which can be provided to the hypervisor via a Puppet Fact.
  • This data is stored, in addition to the default 'cpu' record in ceilometer
The benefits of this approach are
  • There is no need for external lookup to the hardware database to process the accounting
  • No additional rights for the accounting process is required (such as to read the mapping between VM and hypervisor
  • Scenarios such as live migration of VMs from one hypervisor to another of different HS06 are correctly handled
  • No modifications to the ceilometer upstream code are required which both improves deployment time and does not invalidate upstream testing
  • Multiple benchmarks can be run concurrently. This allows a smooth migration from HS06 to a following benchmark HS14 by providing both sets of data.
  • Standard ceilometer behaviour is not modified so existing programs such as Heat which use this data can continue to run
  • This assumes no overcommitment of CPU. Further enhancements to the configuration would be possible in this area but this would require further meters.
  • The information is calculated directly on the hypervisor so it is scalable and it is calculated inline which avoids race conditions when the virtual machine is deleted and therefore the mapping VM to HV is no longer available
The assumptions are
  • The accounting is based on the delivered clock ticks to the hypervisor. This will vary in cases where the hypervisor is running a more recent version of the operating system with a later compiler (and thus probably has a higher HS06 rating). Running older OS versions is therefore corresponding less efficient.
  • The cloud is running at least the Juno OpenStack release
To implement this feature, the pipeline capabilities of ceilometer are used. These are configured automatically by the puppet-ceilometer component into /etc/ceilometer/pipeline.yaml.
The changes required are in several blocks. In the sources section as indicated by
---
sources:
A further source needs to be defined to get the CPU metric available for transformation. This polls every 10 minutes (600 seconds) from the CPU meter and sends the data to the sink for the hs06
    - name: hs06_source
      interval: 600
      meters:
          - "cpu"
      sinks:
          - hs06_sink
The hs06_sink processing is defined later in the file in the sinks section
sinks:
The entry below takes the number of virtual cores of the VM and scales by 10 (which is the example HS06 CPU performance per core) and 0.98 (for the virtualisation overhead factor). It is reported in units of HS06s (i.e. HepSpec 2006). The value of 10 would be derived from the Puppet HS06 value for the machine divided by the number of cores in the server (from the Puppet fact processorcount). Puppet can be used to configure a hard-coded value per hypervisor that is delivered to the machine as a fact and used to generate the pipeline.yaml configuration file.
    - name: hs06_sink
      transformers:
          - name: "arithmetic"
            parameters:
                target:
                    name: "hs06"
                    unit: "HS06"
                    type: "gauge"
                    expr: "$(cpu).resource_metadata.vcpus*10*0.98"
      publishers:
          - notifier://
Once these changes have been done, the ceilometer daemons can be restarted to get the new configuration.
 service openstack-ceilometer-compute restart
If there are errors, these will be reported to /var/log/ceilometer/compute.log. These can be checked with
egrep "(ERROR|WARNING)" /var/log/ceilometer/compute.log
The first messages like "dropping sample with no predecessor" are to be expected as they are handling differences between the previous values and the current ones (such as cpu utilisation).
After 10 minutes or so, ceilometer will poll the CPU, generate the new hs06 value and this can be queried using the ceilometer CLI.
ceilometer meter-list | grep hs06
will include the hs06 meter
| hs06                                | cumulative | HS06        | c6af7651-5fc5-4d37-bf57-c85238ee098c         | 1cdd42569f894c83863e1b76e165a70c | c4b673a3bb084b828ab344a07fa40f54 |
| hs06                                | cumulative | HS06        | e607bece-d9df-4792-904a-3c4adca1b99c         | 1cdd42569f894c83863e1b76e165a70c | c4b673a3bb084b828ab344a07fa40f54 |
and the last 5 entries in the database can be retrieved
ceilometer sample-list -m hs06 -l 5
produces the output
+--------------------------------------+------+-------+--------+------+---------------------+
| Resource ID                          | Name | Type  | Volume | Unit | Timestamp           |
+--------------------------------------+------+-------+--------+------+---------------------+
| 1fa28676-b41c-4673-9d31-1fa83711725a | hs06 | gauge | 12.0   | HS06 | 2015-03-22T09:19:49 |
| 1fa28676-b41c-4673-9d31-1fa83711725a | hs06 | gauge | 12.0   | HS06 | 2015-03-22T09:16:49 |
| 1fa28676-b41c-4673-9d31-1fa83711725a | hs06 | gauge | 12.0   | HS06 | 2015-03-22T09:13:49 |
| 1fa28676-b41c-4673-9d31-1fa83711725a | hs06 | gauge | 12.0   | HS06 | 2015-03-22T09:10:49 |
| b812c69c-3c9f-4146-952e-078a266b11c5 | hs06 | gauge | 11.0   | HS06 | 2015-03-22T08:54:25 |
+--------------------------------------+------+-------+--------+------+---------------------+

References

  1. Ulrich Schwickerath - "VM benchmarking: update on CERN approach" http://indico.cern.ch/event/319819/session/1/contribution/7/material/slides/0.pdf
  2. Ceilometer architecture http://docs.openstack.org/developer/ceilometer/architecture.html
  3. Basic introduction to ceilometer using RDO - https://www.rdoproject.org/CeilometerQuickStart
  4. Ceilometer configuration guide for transformers http://docs.openstack.org/admin-guide-cloud/content/section_telemetry-pipeline-configuration.html
  5. Ceilometer arithmetic guide at https://github.com/openstack/ceilometer-specs/blob/master/specs/juno/arithmetic-transformer.rst

Saturday, 21 March 2015

Nova quota usage - synchronization

Nova quota usage gets frequently out of sync with the real usage consumption.
We are hitting this problem since a couple of releases and it’s increasing with the number of users/tenants in the CERN Cloud Infrastructure.

In nova there are two configuration options (“max_usage” and “until_refresh”) that define when the quota usage should be refreshed. In our case we have configured them with “-1” which means the quota usage must be refreshed every time “_is_quota_refresh_needed” method is called.
For more information about these options you can see a great blog post by Mike Dorman at http://t.co/Q5X1hTgJG1

This worked well in the releases before Havana. The quota gets out of sync and it’s refreshed next time a tenant user performs an operation (ex: create/delete/…).
However, in Havana with the introduction of “user quotas” (https://wiki.openstack.org/wiki/ReleaseNotes/Havana#Quota) this problem started to be more frequent even when forcing the quota to refresh every time.

At CERN Cloud Infrastructure a tenant usually has several users. When a user creates/deletes/… an instance and the quota gets out of sync it will affect all users in the tenant. The quota refresh only updates the resources of the user that is performing the operation and not all tenant resources. This means that in a tenant the quota usage will only be fixed if the user owner of the resource out of sync performs an operation.

The source of quota desync is very difficult to reproduce. In fact all our tries have failed to reproduce it consistently.
In order to fix the quota usage the operator needs to manually calculate the quota that is in use and update the database. This process is very cumbersome, time consuming and is can lead to the introduction of even more inconsistencies in the database.

In order to improve our operations we developed a small tool to check which quotas are out of sync and fix them if necessary.
The tool is available in CERN Operations github at: https://github.com/cernops/nova-quota-sync

How to use it?

usage: nova-quota-sync [-h] [--all] [--no_sync] [--auto_sync]
                       [--project_id PROJECT_ID] [--config CONFIG]

optional arguments:
  -h, --help            show this help message and exit
  --all                 show the state of all quota resources
  --no_sync             don't perform any synchronization of the mismatch
                        resources
  --auto_sync           automatically sync all resources (no interactive)
  --project_id PROJECT_ID
                        searches only project ID

  --config CONFIG       configuration file

The tool calculates the resources in use and compares them with the quota usages.
For example, to see all resources in quota usages that are out of sync:

# nova-quota-sync --no_sync

+-------------+----------+--------------+----------------+----------------------+----------+
| Project ID  | User ID  |  Instances   |     Cores      |         Ram          |  Status  |
+-------------+----------+--------------+----------------+----------------------+----------+
| 58ed2d48... | user_a   |  657 -> 650  |  2628 -> 2600  |  5382144 -> 5324800  | Mismatch |
| 6f999252... | user_b   |    9 -> 8    |    13 -> 11    |    25088 -> 20992    | Mismatch |
| 79d8d0a2... | user_c   |  232 -> 231  |  5568 -> 5544  |  7424000 -> 7392000  | Mismatch |
| 827441b0... | user_d   |   42 -> 41   |    56 -> 55    |   114688 -> 112640   | Mismatch |
| 8a5858da... | user_e   |    2 -> 4    |     2 -> 4     |     1024 -> 2048     | Mismatch |
+-------------+----------+--------------+----------------+----------------------+----------+

The quota usage synchronization can be performed interactively per tenant/project (don’t specify the argument --no_sync) or automatically for all “mismatch” resources with the argument “--auto-sync”.

This tool needs access to nova database. The database endpoint should be defined in the configuration file (it can be nova.conf). Since it reads and updates the database be extremely careful when using it.

Note that quota reservations are not considered in the calculations or updated.