LHC Tunnel

LHC Tunnel

Monday 24 February 2014

Our Cloud in Havana


TL;DR Upgrading a nearly 50,000 core cloud from Grizzly to Havana can be done with a series of steps, each of which can have short periods of reduced functionality but with constant VM availability.

At CERN, we started our production cloud service on Grizzly in July 2013. The previous OpenStack clouds had been pre-production environments with a fixed lifetime (i.e. they were available for use with an end date to be announced where the users would move to the new version via re-creating instances with tools such as Puppet or snapshot/upload instances).

With the Grizzly release, we made the service available with an agreement to upgrade in place rather than build anew. This blog details our experiences moving to the Havana release.

High Level Approach

We took a rolling upgrade approach, component by component, depending on need and complexity.
Following the release notes and operations guide, the order we chose was the following
  1. Ceilometer
  2. Glance
  3. Keystone
  4. Cinder
  5. Client CLIs
  6. Horizon
  7. Nova
The CERN cloud is based on the RDO distribution (see http://openstack.redhat.org). The majority of the servers are running Scientific Linux 6.5 with hypervisors running KVM. We follow a multi-hyper-visor approach so we also have Windows 2012 R2 with Hyper-V. The cloud databases are using MySQL.

Other configurations and different sites may see other issues than listed here so please check that this approach is appropriate for your environment before execution.

Ceilometer

While the cloud was in production in July, we had problems getting ceilometer to work well with cells and with the number of hypervisors we have (over 2,000). Thus, we chose to upgrade early to Havana as this provided much of the functionality we needed and avoided needing to backport.

Havana Ceilometer worked well with Grizzly Nova and allowed us to progress further with detailed metering of our cloud. We are still needing the patch (which has subsequently been included in Havana 2013.2.2 stable release after we upgraded).

Glance

The CERN Glance environment is backed by Ceph. We run multiple glance servers behind an HA Proxy load balancer.

For the upgrade, we
  • Stopped the Glance service at 16h00
  • Performed the database upgrade steps
  • Installed the Havana packages in all top and cell controllers
  • Glance was re-enabled at 16h45
One issue was spotted where access to non-public images was possible when using the nova image-list command. The images were not visible when using glance image-list. As a mitigation, access to the Nova image API was blocked and users were told to use the glance command (as was already recommended).

The root cause was related to the bug https://bugs.launchpad.net/glance/+bug/1152716. The fix for the problem was to add the policy statement for the parameter context_is_admin which is needed to limit access to images for projects.

Keystone

The overall approach taken was to try an online upgrade. We have a keystone in each of our cells (so we can talk to each of them independently of the top level cell API nodes) so these were good candidates to upgrade first. We use Active Directory for the user credentials so the Keystone service itself has a limited amount of state only (related to EC2 credentials and token management).

There were some significant additional functionalities such as tokens based on PKI (to allow validating tokens without calling home) and the V3 API which adds lots of functionality we are interested in such as Domains. We chose to take the incremental small step approach of migrating with the same function and then enabling additional function once the code had been deployed.

For the main keystone instance, we were aiming to benefit from the load balancing layer and that there were minimal database changes in Keystone. The largest problem foreseen was the EC2 credentials since these, in Grizzly, are stored in a different column to those in Havana (just Credentials). These columns had to be kept in sync with a script during the phase where both versions were running.

In practice, we performed the upgrade more rapidly than originally planned as some of the clients gave errors when they had received a token from a Grizzly keystone and were authenticating against a Havana one. Thus, we upgraded all the keystone servers as soon as the final functional tests in the production environment were completed.

The detailed error message of the error on a Grizzly client was

2014-01-23 09:10:02    ERROR [root] 'token'
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/keystone/common/wsgi.py", line 265, in __call__
    result = method(context, **params)
  File "/usr/lib/python2.6/site-packages/keystone/token/controllers.py", line 541, in validate_token
    self._assert_default_domain(context, token_ref)
  File "/usr/lib/python2.6/site-packages/keystone/token/controllers.py", line 482, in _assert_default_domain
    if (token_ref['token_data']['token']['user']['domain']['id'] !=
KeyError: 'token'

The solution was to complete the upgrade to Keystone on all the servers.

During the upgrade, we also noticed a database growth from expired tokens without a purge from the past few months which is now possible to manage using keystone-manage token-flush (
https://blueprints.launchpad.net/keystone/+spec/keystone-manage-token-flush-periodically). This is not a specific upgrade problem but it is worth doing to keep the database manageable.

Cinder

We had backported some Cinder cells functionality to Grizzly so that we could launch our Ceph volume storage function before we migrated to Havana. CERN currently runs a 3.5 PB Ceph service including OpenStack images, OpenStack Volumes and several other projects working with the object and block interfaces into Ceph. We had to take a careful migration approach for Cinder to protect the additional columns which were not in the standard Grizzly tables.

The impact for the user community was that volume creation/deletion was not possible for around 15 minutes during the upgrade. Existing VMs with attached storage continued to run without problems.

Following the standard upgrade procedure of
  • save a list of all volumes before starting using nova volume-list --all-t
  • stop the daemons for cinder
  • Perform full DB backup of cinder
  • update all RPMs
  • cinder-manage db sync
  • restart daemons
  • check new nova volume-list --all-t to see it matches
One minor issue we found was on the client side. Since the nova client depends on the cinder client, we had to upgrade the nova client to get to the latest cinder client version. With the nova client package being backwards compatible, this was not an issue but forced an earlier than planned update of nova client.


Client CLIs

We use the RDO client packages on our Linux machines. The upgrade was a change of repository and yum update.

Likewise, the Windows and Mac clients were upgraded in a similar way using pip. 

Horizon

We have customised Horizon slightly to add some basic features (add help and subscribe buttons to the login page, disable some features which aren't currently supported on the CERN cloud such as security groups). These changes were repeated.

The migration approach was to set up the Horizon web server on a new machine and test out the functionality. Horizon is very tolerant of different versions of code, including partial upgrades such as the CERN environment with Nova downlevel. We found a minor bug with the date picker when using Firefox which we'll be reporting upstream.

Once validated, the alias and virtual host definition in Apache were changed to point to the new machine.

Once under production load, however, we started to see stability issues with the Apache server and large numbers of connections. The users saw authentication problems on login and HTTP 500 errors. In the short term, we put a web server restart on a regular basis to allow us to analyse the problem. With memcached looking after the session information, this was a work around we could use without user disruption but is not a long term solution (from the comments below, this is a reported bug at https://bugs.launchpad.net/python-novaclient/+bug/1247056).

Nova

The Nova migration testing took the longest time for several reasons.

It is our most complex configuration with many cells in two data centres, in Geneva and Budapest. With nearly 50,000 cores and thousands of hypervisors, there is a lot of machines to update.


We have customised Nova to support the CERN legacy network management system. This is a CERN specific database which contains the hostnames, MAC addresses and IPs that is kept up to date by the nova network component.

To test, we performed an offline database copy and stepped through the migration scripts in a cloned environment. The only problems encountered were due to the local tables we had added. We performed a functional and stress test of the environment. With the dynamic nature of the cloud at CERN, we wanted to be sure that there would not be regression in areas such as performance and stability.

During the testing, we found some minor problems
  • euca2ools did not work on our standard SL 6 configuration (version 2.1.4). An error instance-type should be of type string was raised on VM creation. The root cause was a downlevel boto version (see bugzilla ticket)
  • When creating multiple machines using --num-instances on the nova command line with the cells configuration, the unique name created was invalid. A launchpad report was raised but this was a non-blocking issue as the feature is not used often.
We took a very conservative approach to ensure clear steps and post step checkout validation. We had investigated doing the migration cell by cell but we took a cautious approach for our first upgrade. For the actual migration, the steps were as follows:
  • Disable Puppet automatic software updating so we would be in control of which upgrades occurred when.
  • Move the Havana repository from the test/QA environment to the master branch in Puppet
  • Run "yum clean --all" on all the nodes with mcollective 
  • Stop Puppet running on all nodes
  • Starting with the top API controllers, then the cell controllers and finally the compute nodes
    • Block requests to stop new operations arriving
    • Disable automatic restart of daemons by the monitoring system
    • Stop the daemons and disable RabbitMQ
  • Back up all DBs and create a local dump of each in case we need a rapid restore
  • Update all the Nova RPMs on top controllers
  • Update the top DB
  • Reboot to get the latest Linux kernel
  • Update all the Nova RPMs on the cell controller
  • Update the cell DB
  • Reboot for new kernel
  • Update the RPMs on the compute nodes
  • Enable RabbitMQ
  • Enable services on top controllers
  • Enable services in child cell controllers
  • Enable nova-compute
  • Check out the service communications
  • Update the compute nodes using mcollective
  • Enable monitoring/exceptions on all nodes
  • Enable Puppet on all nodes
  • Perform check out tests
  • Enable user facing APIs at the top cells
Total time to run these steps was around 6 hours with the VMs running throughout. The databases back up, cloning and final migration script testing on the production data took a couple of hours. The software upgrade steps were also significant, checking RPMs are deployed across thousands of hypervisors is a lengthy process. Although there were a lot of steps, each one can be performed and checked out before continuing.

Post Production Issues

  • The temporary table containing quota cache data was cleaned as part of the upgrade but does not appear to be 100% recreated. https://bugs.launchpad.net/nova/+bug/1245746 seems to describe the problem. Logging in with the dashboard fixes the problem in most cases.

Conclusions

Upgrading OpenStack in production needs some careful planning and testing but it is a set of standard upgrade steps. There are lots of interesting features in Havana to explore for both the operations team and the end users of the CERN cloud.

Credits

This information is collected from the cloud team from CERN which performed the upgrade (Belmiro, Jose, Luis, Marcos, Thomas, Stefano) while also providing user support and adding new hardware.