LHC Tunnel

LHC Tunnel

Sunday, 29 November 2015

Our cloud in Kilo


Following on from previous upgrades, CERN migrated the OpenStack cloud to Kilo during September to November. Along with the bug fixes, we are planning on exploiting the significant number of new features, especially as related to performance tuning. The overall cloud architecture was covered at the Tokyo OpenStack summit video https://www.openstack.org/summit/tokyo-2015/videos/presentation/unveiling-cern-cloud-architecture.

As the LHC continues to run 24x7, these upgrades were done while the cloud was running and virtual machines were untouched.

Previous upgrades have been described as below
The staged approach was used again. While most of the steps went smoothly, a few problems were encountered.
  • Cinder - we encountered the bug https://bugs.launchpad.net/cinder/+bug/1455726 which led to a foreign key error. The cause appears to be related to UTF8. The patch (https://review.openstack.org/#/c/183814/) was not completed so did not get included into the release. More details at the thread at http://lists.openstack.org/pipermail/openstack/2015-August/013601.html.
  • Keystone - one of the configuration parameters for caches had changed syntax and this was not reflected in the configuration generated by Puppet. The symptoms were high load on the Keystone servers since caching was not enabled.
  • Glance - given the rolling upgrade on Glance, we took advantage of having virtualised the majority of the Glance server pool. This allows new resources to be brought online with a Juno configuration and the old ones deleted.
  • Nova - we upgraded the control plane services along with the QA compute nodes. With the versioned objects, we could stage the migration of the thousands of compute nodes so that we did not need to do all the updates at once. Puppet looked after the appropriate deployments of the RPMs.
    • Following the upgrade, we had an outage of the metadata service for the OpenStack specific metadata. The EC2 metadata works fine. This is a cells related issue and we'll create a bug/blueprint for the fix.
    • The VM resize functions are giving errors during the execution. We're tracking this with the upstream developers.
    • We wanted to use the latest Nova NUMA features. We encountered a problem with cells and this feature, although it worked well in a non-cells cloud. This is being tracked in https://bugs.launchpad.net/nova/+bug/1517006. We will use the new features for performance optimisation once these problems are resolved.
    • The dynamic migration of flavors was only partially successful. With the cells database having the flavors data in two places, the migration needed to be done simultaneously. We resolved this by forcing the migration of the flavors to the new endpoint,
    • The handling of ephemeral drives in Kilo seems to be different from Juno. The option default_ephemeral_format now defaults to vfat, rather than ext3. The aim seems to have been to give vfat to Windows and ext4 to Linux but our environment does not follow this. This was reported by Nectar but we could not find any migration advice in the Kilo release notes. We have set the default to ext3 while we are working out the migration implications.
    • We're also working through a scaling problem for our most dynamic cells at https://bugs.launchpad.net/nova/+bug/1524114. Here all VMs are being queried by the scheduler, not just the active ones. Since we create/delete hundreds of VMs an hour, there are large volumes of deleted VMs which made one query take longer than expected.
Catching these cases with cells early is part of the work for the scope of the the Cell V2 project at https://wiki.openstack.org/wiki/Nova-Cells-v2 to which we are contributing along with the BARC centre in Mumbai so that the cells configuration becomes the default (with only a single cell) and the upstream test cases are enhanced to validate  the multi cell configuration.

As some of the hypervisors are still running Scientific Linux 6, we used the approach from GoDaddy to package the components using software collections. Details are available at https://github.com/krislindgren/openstack-venv-cent6. We used this for nova and ceilometer which are the agents installed on the hypervisors. The controllers were upgraded to CentOS 7 as part of the upgrade to Kilo.

Overall, getting to Kilo enables new features and includes bug fixes to reduce administration effort. Keeping up with new releases requires careful planning and sharing upstream activities such as the Puppet modules but has proven to be the best approach. With many of the CERN OpenStack team in the summit in Tokyo, we did not complete the upgrade before Liberty was released but this has been completed soon afterwards.

With the Kilo base in production, we are now ready to start work on the Nova network to Neutron migration, deployment of the new EC2 API project and enabling Magnum for container native applications.