Thursday, 21 May 2015

Juno, EL6 and RDO Community.

Since 2011, CERN selected OpenStack as a cloud platform. It was natural to choose RDO as our RPMs provider ; RDO is a community of people using and deploying OpenStack on Red Hat Enterprise Linux, Fedora and distributions derived from these (such as Scientific Linux CERN 6 which powers our hypervisors).

The community decided not to provide official upgrade path from Icehouse to Juno on el6 systems.

While our internal infrastructure is now moving to CentOS 7, we have to maintain during the transition around 2500 compute nodes under SLC6.

As it was mentioned in the previous blog post, we recently finished the migration from IceHouse to Juno. Part of this effort was to rebuild Juno RDO packages for RHEL6 derivative and provide a tested upgrade path from IceHouse.

We are happy to announce that we recompiled openstack-nova and openstack-ceilometer packages publicly with the help of the CentOS infrastructure and made them available to the community.

The effort is led by the CentOS Cloud SIG and I'd like to thank particularly Alan Pevec, Haïkel Guemar and Karanbir Singh for their support and time.

For all the information and how to use the Juno EL6 packages please follow this link

Tuesday, 12 May 2015

Our cloud in Juno

This blog continues our series around upgrades of OpenStack. Previous upgrades are documented in
At CERN, we do incremental upgrades of the cloud, component by component. Giving details of the problems we encounter along the way.

For Juno, we followed the pattern as previously
  • cinder
  • glance
  • keystone
  • ceilometer
  • nova
  • horizon
As we are now rolling out our CentOS 7 based controllers, we took the opportunity to do that upgrade also. Many of the controllers themselves are virtualised which allows us to scale out as needed. An HA proxy configuration allows us to switch rapidly between the services on the different levels.

The motivation to move to CentOS 7 comes from two primary sources
  • CERN is moving from its own distribution, Scientific Linux CERN, to CentOS 7.
  • The RDO packages are being produced on CentOS 7 now. This means that we can benefit from community testing if we are also on that version.
We'll give more details on the SLC6 environment in a future posting.

We encountered one problem during the upgrade. The LDAP backend for roles compared the definition with an upper case role definition. We were using lower case roles in LDAP. This was resolved with a quick workaround and a bug will be reported around

Other than that, the upgrade proceeded smoothly and we're now looking forward to deploying Heat and starting the planning for the migration to Kilo.

Tuesday, 5 May 2015

Purging Nova databases in a cell environment

CERN cloud infrastructure is running in production since July 2013 and during almost 2 years more than 1,000,000 production VMs were created in the infrastructure.

During the last few months, we have had an average of 11,000 VMs running with a creation/deletion rate between 100 and 200 VMs per hour.

Number of VMs created (cumulative)

Difference between VMs created and deleted per month (cumulative)

The information of all these instances is stored in the database and remains when the instances are deleted because nova uses “soft” delete (mark the records as deleted without removing them). As a consequence, the database size grows overtime being more difficult to manage and increasing operations time.

At CERN, we have the policy to preserve deleted instances information for 3 months in the database before removing them.

Nova has the functionality to move the deleted instances to “shadow” tables (nova-manage db archive_deleted_rows [--max_rows <number>]). This functionality can 
remove all deleted entries in the main tables or a maximum number of rows can be specified. However, a row doesn’t mean an instance because some tables have several entries for the same instance. Also, in a cloud that uses cells running the “archive_deleted_rows” with “max_rows” defined will not keep the top and children cells in sync.

In order to remove the database entries of deleted instances in a cell environment with a grace period we developed a small tool that is available at:

We start removing deleted instances from the top database defining a date until the rows should be deleted.
python cern-db-purge --date "2015-02-01 00:00:00" --config nova.conf

Cascading doesn’t work in nova database so the script checks if the instances were deleted before the specified date and removes all the rows associated with them in the different tables. Also, it saves the uuid and some more information about an instance in a file. We decided to not delete the instances from top and children at same time to have more operational control during the interventions. 

This file is used to remove the deleted instances from the children cells. 
python cern-db-purge --file "delete_these_instances.txt" --cell 'top_cell!child_cell_01' --config nova.conf

The script goes through all instances in the file for that specific child cell and after some consistency checks removes all the rows related with the instances.
In this way we make sure that if an instance is removed from the top cell is also deleted from the children cells.

Depending in the database size the tool can take several hours to run.

This tool needs access to nova database and was tested with Icehouse release. The database endpoint should be defined in the configuration file (it can be nova.conf). Since it reads and updates the database, the administrator should be extremely careful when using it.

Monday, 23 March 2015

Not all cores are created equal

Within CERN's compute cloud, the hypervisors vary significantly in performance. We generally run the servers for around 5 years before retirement and there are around 3 different configurations selected each year through public procurement.

Benchmarking in High Energy Physics is done using a benchmark suite called HEPSpec 2006 (HS06). This is based on the C++ programs within the Spec 2006 suite run in parallel according the number of cores in the server. The performance range is around a factor of 3 between the slowest and the fastest machines [1].  

When machines are evaluated after delivery, the HS06 rating for each hardware configuration is saved into a hardware inventory database.

Defining a flavor for each hardware type was not attractive as there are 15 different configurations to consider and users would not easily find out which flavors have free cores. Instead, users ask for the standard flavors, such as m1.small 1 core virtual machine, you could land on a hypervisor giving 6 HS06 or one giving 16. However, the accounting and quotas is done using virtual cores so the 6 and 16 HS06 virtual cores are considered equivalent.

in order to improve our accounting, we therefore wanted to provide the performance of the VM along with the metering records giving the CPU usage through ceilometer. Initially, we thought that this would require some additional code to be added to ceilometer but this is actually possible using the standard ceilometer functions with transformers and publishers.

The following approach was implemented.
  • On the hypervisor, we added an additional meter 'hs06' which provides the CPU rating of the VM normalised by the HS06 performance of the hypervisor. This value is determined using the HS06 value stored in the Hardware Database which can be provided to the hypervisor via a Puppet Fact.
  • This data is stored, in addition to the default 'cpu' record in ceilometer
The benefits of this approach are
  • There is no need for external lookup to the hardware database to process the accounting
  • No additional rights for the accounting process is required (such as to read the mapping between VM and hypervisor
  • Scenarios such as live migration of VMs from one hypervisor to another of different HS06 are correctly handled
  • No modifications to the ceilometer upstream code are required which both improves deployment time and does not invalidate upstream testing
  • Multiple benchmarks can be run concurrently. This allows a smooth migration from HS06 to a following benchmark HS14 by providing both sets of data.
  • Standard ceilometer behaviour is not modified so existing programs such as Heat which use this data can continue to run
  • This assumes no overcommitment of CPU. Further enhancements to the configuration would be possible in this area but this would require further meters.
  • The information is calculated directly on the hypervisor so it is scalable and it is calculated inline which avoids race conditions when the virtual machine is deleted and therefore the mapping VM to HV is no longer available
The assumptions are
  • The accounting is based on the delivered clock ticks to the hypervisor. This will vary in cases where the hypervisor is running a more recent version of the operating system with a later compiler (and thus probably has a higher HS06 rating). Running older OS versions is therefore corresponding less efficient.
  • The cloud is running at least the Juno OpenStack release
To implement this feature, the pipeline capabilities of ceilometer are used. These are configured automatically by the puppet-ceilometer component into /etc/ceilometer/pipeline.yaml.
The changes required are in several blocks. In the sources section as indicated by
A further source needs to be defined to get the CPU metric available for transformation. This polls every 10 minutes (600 seconds) from the CPU meter and sends the data to the sink for the hs06
    - name: hs06_source
      interval: 600
          - "cpu"
          - hs06_sink
The hs06_sink processing is defined later in the file in the sinks section
The entry below takes the number of virtual cores of the VM and scales by 10 (which is the example HS06 CPU performance per core) and 0.98 (for the virtualisation overhead factor). It is reported in units of HS06s (i.e. HepSpec 2006). The value of 10 would be derived from the Puppet HS06 value for the machine divided by the number of cores in the server (from the Puppet fact processorcount). Puppet can be used to configure a hard-coded value per hypervisor that is delivered to the machine as a fact and used to generate the pipeline.yaml configuration file.
    - name: hs06_sink
          - name: "arithmetic"
                    name: "hs06"
                    unit: "HS06"
                    type: "gauge"
                    expr: "$(cpu).resource_metadata.vcpus*10*0.98"
          - notifier://
Once these changes have been done, the ceilometer daemons can be restarted to get the new configuration.
 service openstack-ceilometer-compute restart
If there are errors, these will be reported to /var/log/ceilometer/compute.log. These can be checked with
egrep "(ERROR|WARNING)" /var/log/ceilometer/compute.log
The first messages like "dropping sample with no predecessor" are to be expected as they are handling differences between the previous values and the current ones (such as cpu utilisation).
After 10 minutes or so, ceilometer will poll the CPU, generate the new hs06 value and this can be queried using the ceilometer CLI.
ceilometer meter-list | grep hs06
will include the hs06 meter
| hs06                                | cumulative | HS06        | c6af7651-5fc5-4d37-bf57-c85238ee098c         | 1cdd42569f894c83863e1b76e165a70c | c4b673a3bb084b828ab344a07fa40f54 |
| hs06                                | cumulative | HS06        | e607bece-d9df-4792-904a-3c4adca1b99c         | 1cdd42569f894c83863e1b76e165a70c | c4b673a3bb084b828ab344a07fa40f54 |
and the last 5 entries in the database can be retrieved
ceilometer sample-list -m hs06 -l 5
produces the output
| Resource ID                          | Name | Type  | Volume | Unit | Timestamp           |
| 1fa28676-b41c-4673-9d31-1fa83711725a | hs06 | gauge | 12.0   | HS06 | 2015-03-22T09:19:49 |
| 1fa28676-b41c-4673-9d31-1fa83711725a | hs06 | gauge | 12.0   | HS06 | 2015-03-22T09:16:49 |
| 1fa28676-b41c-4673-9d31-1fa83711725a | hs06 | gauge | 12.0   | HS06 | 2015-03-22T09:13:49 |
| 1fa28676-b41c-4673-9d31-1fa83711725a | hs06 | gauge | 12.0   | HS06 | 2015-03-22T09:10:49 |
| b812c69c-3c9f-4146-952e-078a266b11c5 | hs06 | gauge | 11.0   | HS06 | 2015-03-22T08:54:25 |


  1. Ulrich Schwickerath - "VM benchmarking: update on CERN approach"
  2. Ceilometer architecture
  3. Basic introduction to ceilometer using RDO -
  4. Ceilometer configuration guide for transformers
  5. Ceilometer arithmetic guide at

Saturday, 21 March 2015

Nova quota usage - synchronization

Nova quota usage gets frequently out of sync with the real usage consumption.
We are hitting this problem since a couple of releases and it’s increasing with the number of users/tenants in the CERN Cloud Infrastructure.

In nova there are two configuration options (“max_usage” and “until_refresh”) that define when the quota usage should be refreshed. In our case we have configured them with “-1” which means the quota usage must be refreshed every time “_is_quota_refresh_needed” method is called.
For more information about these options you can see a great blog post by Mike Dorman at

This worked well in the releases before Havana. The quota gets out of sync and it’s refreshed next time a tenant user performs an operation (ex: create/delete/…).
However, in Havana with the introduction of “user quotas” ( this problem started to be more frequent even when forcing the quota to refresh every time.

At CERN Cloud Infrastructure a tenant usually has several users. When a user creates/deletes/… an instance and the quota gets out of sync it will affect all users in the tenant. The quota refresh only updates the resources of the user that is performing the operation and not all tenant resources. This means that in a tenant the quota usage will only be fixed if the user owner of the resource out of sync performs an operation.

The source of quota desync is very difficult to reproduce. In fact all our tries have failed to reproduce it consistently.
In order to fix the quota usage the operator needs to manually calculate the quota that is in use and update the database. This process is very cumbersome, time consuming and is can lead to the introduction of even more inconsistencies in the database.

In order to improve our operations we developed a small tool to check which quotas are out of sync and fix them if necessary.
The tool is available in CERN Operations github at:

How to use it?

usage: nova-quota-sync [-h] [--all] [--no_sync] [--auto_sync]
                       [--project_id PROJECT_ID] [--config CONFIG]

optional arguments:
  -h, --help            show this help message and exit
  --all                 show the state of all quota resources
  --no_sync             don't perform any synchronization of the mismatch
  --auto_sync           automatically sync all resources (no interactive)
  --project_id PROJECT_ID
                        searches only project ID

  --config CONFIG       configuration file

The tool calculates the resources in use and compares them with the quota usages.
For example, to see all resources in quota usages that are out of sync:

# nova-quota-sync --no_sync

| Project ID  | User ID  |  Instances   |     Cores      |         Ram          |  Status  |
| 58ed2d48... | user_a   |  657 -> 650  |  2628 -> 2600  |  5382144 -> 5324800  | Mismatch |
| 6f999252... | user_b   |    9 -> 8    |    13 -> 11    |    25088 -> 20992    | Mismatch |
| 79d8d0a2... | user_c   |  232 -> 231  |  5568 -> 5544  |  7424000 -> 7392000  | Mismatch |
| 827441b0... | user_d   |   42 -> 41   |    56 -> 55    |   114688 -> 112640   | Mismatch |
| 8a5858da... | user_e   |    2 -> 4    |     2 -> 4     |     1024 -> 2048     | Mismatch |

The quota usage synchronization can be performed interactively per tenant/project (don’t specify the argument --no_sync) or automatically for all “mismatch” resources with the argument “--auto-sync”.

This tool needs access to nova database. The database endpoint should be defined in the configuration file (it can be nova.conf). Since it reads and updates the database be extremely careful when using it.

Note that quota reservations are not considered in the calculations or updated.

Tuesday, 17 February 2015

Delegation of rights

At CERN, we have 1st and 2nd level support teams to run the computer centre infrastructure. These groups provide 24x7 coverage for problems and initial problem diagnosis to determine which 3rd line support team needs to be called in the event of a critical problem. Typical operations required are
  • Stop/Start/Reboot server
  • Inspect console
When we ran application services on physical servers, these activities could be performed using a number of different technologies
  • KVM switches
  • IPMI for remote maagement
  • Power buttons and the console trolley
With a virtual infrastructure, the applications are now running on virtual machines within a project. These operations are not available by default for the 1st and 2nd level teams since only the members of the project can perform these commands. On the other hand, the project administrator rights contain other operations (such as delete or rebuild servers) which are not needed by these teams.

To address this, we have defined an OpenStack policy for the projects concerned. This is an opt-in process so that the project administrator needs to decide whether these delegated rights should be made available (either at project creation or later).

Define operator role

The first step is to define a new role, operator, for the projects concerned. This can be done through the GUI ( or via the CLI ( In CERN's case, we include it into the workflow in the project creation.

On a default configuration,

$ keystone role-list
|                id                |      name     |
| ef8afe7ea1864b97994451fbe949f8c9 | ResellerAdmin |
| 8fc0ca6ef49a448d930593e65fc528e8 | SwiftOperator |
| 9fe2ff9ee4384b1894a90878d3e92bab |    _member_   |
| 172d0175306249d087f9a31d31ce053a |     admin     |

A new role operator needs to be defined, using the steps from the documentation

 $ keystone role-create --name operator
| Property |              Value               |
|    id    | e97375051a0e4bdeaf703f5a90892996 |
|   name   |             operator             |

and the new role will then appear in the keystone role-list.

Now add a new user operator1

$ keystone user-create --name operator1 --pass operatorpass
| Property |              Value               |
|  email   |                                  |
| enabled  |               True               |
|    id    | f93a50c12c164f329ee15d4d5b0e7999 |
|   name   |            operator1             |
| username |            operator1             |

and add the  operator1 account to the role

$ keystone user-role-add --user operator1 --role operator  --tenant demo
$ keystone user-role-list --tenant demo --user operator1

A similar role is defined for accounting which is used to allow the CERN accounting system read-only access to data about instances so that an accounting report can be produced without needing OpenStack admin rights.

For mapping which users are given this role, we use the Keystone V3 functions available through the OpenStack unified CLI.

$ openstack role add --group operatorGroup --role operator  --tenant demo

Using a group operatorGroup, we are able to define the members in Active Directory and then have those users updated automatically with consistent role sets. The users can also be added explicitly

$ openstack role add --user operator1 --role operator  --tenant demo

Update nova policy

The key file is called policy.json in /etc/nova which defines the roles and what they can do. There are two parts to the rules, firstly a set of groupings which give a human readable description for a complex rule set such as a member is someone who is not an accounting role and not an operator:

    "context_is_admin":  "role:admin",
    "context_is_member": "not role:accounting and not role:operator",
    "admin_or_owner":  "is_admin:True or (project_id:%(project_id)s and rule:context_is_member)",
    "default": "rule:admin_or_owner",
    "default_or_operator": "is_admin:True or (project_id:%(project_id)s and not role:accounting)",

The particular rules are relatively self-descriptive.

The actions can then be defined using these terms

    "compute:get_all": "rule:default_or_operator",
    "compute:get_all_tenants": "rule:default_or_operator",
    "compute:reboot":"rule:default_or_operator", "compute:get_vnc_console":"rule:default_or_operator",
    "compute_extension:console_output": "rule:default_or_operator",
    "compute_extension:consoles": "rule:default_or_operator",

With this, a user group can be defined to allow stop/start/reboot/console while not being able to perform the more destructive operations such as delete.

Thursday, 5 February 2015

Choosing the right image

Over the past 18 months of production at CERN, we have provided a number of standard images for the end users to use when creating virtual machines.
  • Linux
    • Scientific Linux CERN 5
    • Scientific Linux CERN 6
    • CERN CentOS 7
  • Windows
    • Windows 7
    • Windows 8
    • Windows Server 2008
    • Windows Server 2012
To accelerate deployment of new VMs, we also often have
  • Base images which are the minimum subset of packages on which users can build their custom virtual machines using features such as cloud-init or Puppet to install additional packages
  • Extra images which contain common additional packages and accelerate the delivery of a working environment. These profiles are often close to a desktop like PC-on-demand. Installing a full Office 2013 suite on to a new virtual machine can take over one hour so preparing this in advance saves time for the users.
However, with each of these images and additional software, there is a need to maintain the image contents up to date.
  • Known security issues should be resolved within the images rather than relying on the installation of new software after the VM has booted
  • Installation of updates slows do the instantiation of virtual machines
The images themselves are therefore rebuilt on a regular basis and published to the community as public images. The old images, however, should not be deleted as they are needed in the event of a resize or live migration (see Images cannot be replaced in Glance since this would lead to inconsistencies on the hypervisors.

As a result, the number of images in the catalog increases on a regular basis. For the web based end user, this can make navigating the Horizon GUI panel for the images difficult and increase the risk that an out of date image is selected.

 The approach that we have taken is to build on the image properties ( which allow the image maintainer to tag images with attributes. We use the following from the standard list

  • architecture => "x86_64", "i686"
  • os_distro => "slc", "windows"
  • os_distro_major => "6", "5"
  • os_distro_minor => "4"
  • os_edition => "<Base|Extra>" which set of additional packages were installed into the image.
  • release_date => "2014-05-02T13:02:00" for at what date was the image made available to the public user
  • custom_name => "A custom name" allows a text string to override the default name (see below)
  • upstream_provider => URL gives a URL to contact in the event of problems with the image. This is useful where the image is supplied by a 3rd party and the standard support lines should not be used.

We also defined additional fields

With these additional fields, the latest images can be selected and a subset presented for the end user to choose.

The algorithm used is as follows with the sorting sequence

  1. os_distro ASC
  2. os_distro_major DESC
  3. os_distro_minor DESC
  4. architecture DESC
  5. release_date DESC
Images which are from previous releases (i.e. where os_distro, os_distro_major and architecture are the same) are only shown if the 'All' tab is selected.

The code is in preparation to be proposed as an upstream patch. For the moment, it can be found in the CERN github repository (