Tuesday, 2 May 2017

Migrating to Keystone Fernet Tokens

For several OpenStack releases, the Identity service in OpenStack offers an additional token format than UUID, called Fernet. This token format has a series of advantages over the UUID, the most prominent for us is that it doesn't need to be persisted. We were also interested in a speedup of the token creation and validation.

At CERN, we have been using the UUID format for the tokens since the beginning of the cloud deployment in 2013. Normally in the keystone database we have around 300,000 tokens with an expiration of 1 day. In order to keep the database size controlled, we run the token_flush procedure every hour.

In the Mitaka release, all remaining bugs were sorted out and since the Ocata release of OpenStack, Fernet is now the default. Right now, we are running keystone in Mitaka and we decided to migrate to the Fernet token format before the upgrade of the Ocata release.

Before reviewing the upgrade from UUID to Fernet, let's have a brief look on the architecture of the identity service. The service resolves into a set of load balancers and then they redirect to a set of backends running keystone under apache. This allows us to replace/increase the backends transparently.

The very first question is how many keys we would need. If we take the formula from [1]:

fernet_keys = validity / rotation_time + 2

If we have a validity of 24 hours and a rotation every 6 hours, we would need 24/6 + 2 = 6 keys

As we have several backends, the first task is to distribute the keys among the backends, for that we are using puppet that provides the secrets in the /etc/keystone/fernet-keys folder. With that we ensure that a new introduced backend will always have the last set of keys available.

The second task is how to rotate them. In our case we are using a cronjob in our rundeck installation that rotates the secrets and introduces a new one. This job is doing exactly the same as the keystone-manage token-flush command. One important aspect is that on each rotation, you need to reload or restart the Apache daemon to load the keys from the disk.

So we prepare all this changes in the production setup quite some time ago, and we were testing that the keys were correctly updated and distributed. On April 5th, we decided to go on. This is the picture of the API messages during the intervention.

There are two distinct steps in the API messages usage. The first one is a peak of invalid tokens. This is triggered by the end users trying to validate UUID tokens after the change. The second peak is related to OpenStack services that are using trusts, like Magnum and Heat. From our past experience, these services can be affected by a massive invalidation of tokens. The trust credentials are cached and you need to restart both services so both services will get their Fernet tokens.

Below is a picture of the size of the token table in the keystone database, as we are now in Fernet is going slowly down to zero, due to the hourly token_flush command I mentioned earlier.

The last picture is the response time of the identity service during the change. As you can see the response time is better than on UUID as stated in [2]

In the Ocata release, more improvements are on the way to improve the response time, and we are working to update the identity service in the near future.


  1. Fernet token FAQ at https://docs.openstack.org/admin-guide/identity-fernet-token-faq.html
  2. Fernet token performance at http://blog.dolphm.com/openstack-keystone-fernet-tokens/
  3. Payloads for Fernet token format at http://blog.dolphm.com/inside-openstack-keystone-fernet-token-payloads/
  4. Deep dive into Fernet at https://developer.ibm.com/opentech/2015/11/11/deep-dive-keystone-fernet-tokens/

Thursday, 9 February 2017

os_type property for Windows images on KVM

The OpenStack images have a long list of properties which can set to describe the image meta data. The full list is described in the documentation. This blog reviews some of these settings for Windows guests running on KVM, in particular for Windows 7 and Windows 2008R2.

At CERN, we've used a number of these properties to help users filter images such as the OS distribution and version but also added some additional properties for specific purposes such as

  • when the image was released (so the images can be sorted by date)
  • whether the image is the latest recommended one (such as setting the CentOS 7.2 image to not recommended when CentOS 7.3 comes out)
  • which CERN support team provided the image 

For a typical Windows image, we have

$ glance image-show 9e194003-4608-4fe3-b073-00bd2a774a57
| Property          | Value                                                          |
| architecture      | x86_64                                                         |
| checksum          | 27f9cf3e1c7342671a7a0978f5ff288d                               |
| container_format  | bare                                                           |
| created_at        | 2017-01-27T16:08:46Z                                           |
| direct_url        | rbd://b4f463a0-c671-43a8-bd36-e40ab8d233d2/images/9e194003-4   |
| disk_format       | raw                                                            |
| hypervisor_type   | qemu                                                           |
| id                | 9e194003-4608-4fe3-b073-00bd2a774a57                           |
| min_disk          | 40                                                             |
| min_ram           | 0                                                              |
| name              | Windows 10 - With Apps [2017-01-27]                            |
| os                | WINDOWS                                                        |
| os_distro         | Windows                                                        |
| os_distro_major   | w10entx64                                                      |
| os_edition        | DESKTOP                                                        |
| os_version        | UNKNOWN                                                        |
| owner             | 7380e730-d36c-44dc-aa87-a2522ac5345d                           |
| protected         | False                                                          |
| recommended       | true                                                           |
| release_date      | 2017-01-27                                                     |
| size              | 37580963840                                                    |
| status            | active                                                         |
| tags              | []                                                             |
| updated_at        | 2017-01-30T13:56:48Z                                           |
| upstream_provider | https://cern.service-now.com/service-portal/function.do?name   |
| virtual_size      | None                                                           |
| visibility        | public                                                         |

Recently, we have seen some cases of Windows guests becoming unavailable with the BSOD error "CLOCK_WATCHDOG_TIMEOUT (101)".  On further investigation, these tended to occur around times of heavy load on the hypervisors such as another guest doing CPU intensive work.

Windows 7 and Windows Server 2008 R2 were the guest OSes where these problems were observed. Later OS levels did not seem to show the same problem.

We followed the standard processes to make sure the drivers were all updated but the problem still occurred.

Looking into the root cause, the Red Hat support articles were a significant help.

"In the environment described above, it is possible that 'CLOCK_WATCHDOG_TIMEOUT (101)' BSOD errors could be due to high load within the guest itself. With virtual guests, tasks may take more time that expected on a physical host. If Windows guests are aware that they are running on top of a Microsoft Hyper-V host, additional measures are taken to ensure that the guest takes this into account, reducing the likelihood of the guest producing a BSOD due to time-outs being triggered."

These suggested to use the os_type parameter to help inform the hypervisor to use some additional flags. However, the OpenStack documentation explained this was a XenAPI only setting (which would not therefore apply for KVM hypervisors).

It is not always clear which parameters to set for an OpenStack image. Setting os_distro has a value such as 'windows' or 'ubuntu'. While the flavor of the OS could be determined, the setting of os_type is needed to be used by the code.

Thus, in order to get the best behaviour for Windows guests, from our experience, we would recommend setting both the os_distro and os_type as follows.
  • os_distro = 'windows'
  • os_type = 'windows'
When the os_type parameter is set, some additional XML is added to the KVM configuration following the Kilo enhancement.

      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
  <clock offset='localtime'>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='hpet' present='no'/>
    <timer name='hypervclock' present='yes'/>

These changes have led to an improvement when running on a loaded hypervisors, especially for Windows 7 and 2008R2 guests. A bug has been opened for the documentation to explain the setting is not Xen only.


  • Jose Castro Leon performed all of the analysis and testing of the various solutions.


Monday, 9 January 2017

Containers on the CERN cloud

We have recently made the Container-Engine-as-a-Service (Magnum) available in production at CERN as part of the CERN IT department services for the LHC experiments and other CERN communities. This gives the OpenStack cloud users Kubernetes, Mesos and Docker Swarm on demand within the accounting, quota and project permissions structures already implemented for virtual machines.

We shared the latest news on the service with the CERN technical staff (link). This is the follow up on the tests presented at the OpenStack Barcelona (link) and covered in the blog from IBM. The work has been helped by collaborations with Rackspace in the framework of the CERN openlab and the European Union Horizon 2020 Indigo Datacloud project.


At the Barcelona summit, we presented with Rackspace and IBM regarding our additional performance tests after the previous blog post. We expanded beyond the 2M requests/s to reach around 7M where some network infrastructure issues unrelated to OpenStack limited the scaling further.

As we created the clusters, the deployment time increased only slightly with the number of nodes as most of the work is done in parallel. But for 128 node or larger clusters, the increase in time started to scale almost linearly. At the Barcelona summit, the Heat and Magnum teams worked together to develop proposals for how to improve further in future releases, although a 1000 node cluster in 23 minutes is still a good result

Cluster Size (Nodes)
Deployment Time (min)


With the LHC producing nearly 50PB this year, High Energy Physics has some custom storage technologies for specific purposes, EOS for physics data, CVMFS for read-only, highly replicated storage such as applications.

One of the features of providing a private cloud service to the CERN users is to combine the functionality of open source community software such as OpenStack with the specific needs for high energy physics. For these to work, some careful driver work is needed to ensure appropriate access while ensuring user rights. In particular,
  • EOS provides a disk-based storage system providing high-capacity and low-latency access for users at CERN. Typical use cases are where scientists are analysing data from the experiments.
  • CVMFS is used for a scalable, reliable and low-maintenance for read-only data such as software.
There are also other storage solutions we use at CERN such as
  • HDFS for long term archiving of data using Hadoop which uses an HDFS driver within the container.  HDFS works in user space, so no particular integration was required to use it from inside (unprivileged) containers
  • Cinder provides additional disk space using volumes if the basic flavor does not have sufficient. This Cinder integration is offered by upstream Magnum, and work was done in the last OpenStack cycle to improve security by adding support for Keystone trusts.
CVMFS was more straightforward as there is no need to authenticate the user. The data is read-only and can be exposed to any container. The access to the file system is provided using a driver (link) which has been adapted to run inside a container. This saves having to run additional software inside the VM hosting the container.

EOS requires authentication through mechanisms such as Kerberos to identify the user and thus determine what files they have access to. Here a container is run per user so that there is no risk of credential sharing. The details are in the driver (link).

Service Model

One interesting question that came up during the discussions of the container service was how to deliver the service to the end users. There are several scenarios:
  1. The end user launches a container engine with their specifications but they rely on the IT department to maintain the engine availability. This implies that the VMs running the container engine are not accessible to the end user.
  2. The end user launches the engine within a project that they administer. While the IT department maintains the templates and basic functions such as the Fedora Atomic images, the end user is in control of the upgrades and availability.
  3. A variation of option 2., where the nodes running containers are reachable and managed by the end user, but the container engine master nodes are managed by the IT department. This is similar to the current offer from the Google Container Engine and requires some coordination and policies regarding upgrades
Currently, the default Magnum model is for the 2nd option and adding option 3 is something we could do in the near future. As users become more interested in consuming containers, we may investigate the 1st option further


Many applications at use in CERN are in the process of being reworked for a microservices based architecture. A choice of different container engines is attractive for the software developer. One example of this is the file transfer service which ensures that the network to other high energy physics sites is kept busy but not overloaded with data transfers. The work to containerise this application was described at the recent CHEP 2016 FTS poster.

While deploying containers is an area of great interest for the software community, the key value comes from the physics applications exploiting containers to deliver a new way of working. The Swan project provides a tool for running ROOT, the High Energy Physics application framework, in a browser with easy access to the storage outlined above. A set of examples can be found at https://swan.web.cern.ch/notebook-galleries. With the academic paper, the programs used and the data available from the notebook, this allows easy sharing with other physicists during the review process using CERNBox, CERN's owncloud based file sharing solution.

Another application being studied is http://opendata.cern.ch/?ln=en which allows the general public to run analyses on LHC open data. Typical applications are Citizen Science and outreach for schools.

Ongoing Work

There are a few major items where we are working with the upstream community:
  • Cluster upgrades will allow us to upgrade the container software. Examples of this would be a new version of Fedora Atomic, Docker or the container engine. With a load balancer, this can be performed without downtime (spec)
  • Heterogeneous cluster support will allow nodes to have different flavors (cpu vs gpu, different i/o patterns, different AZs for improved failure scenarios). This is done by splitting the cluster nodes into node groups (blueprint)
  • Cluster monitoring to deploy Prometheus and cAdvisor with Grafana dashboards for easy monitoring of a Magnum cluster (blueprint).