OpenStack in Production - Archives: May 2015

Thursday, 21 May 2015

Juno, EL6 and RDO Community.

Since 2011, CERN selected OpenStack as a cloud platform. It was natural to choose RDO as our RPMs provider ; RDO is a community of people using and deploying OpenStack on Red Hat Enterprise Linux, Fedora and distributions derived from these (such as Scientific Linux CERN 6 which powers our hypervisors).

The community decided not to provide official upgrade path from Icehouse to Juno on el6 systems.

While our internal infrastructure is now moving to CentOS 7, we have to maintain during the transition around 2500 compute nodes under SLC6.

As it was mentioned in the previous blog post, we recently finished the migration from IceHouse to Juno. Part of this effort was to rebuild Juno RDO packages for RHEL6 derivative and provide a tested upgrade path from IceHouse.

We are happy to announce that we recompiled openstack-nova and openstack-ceilometer packages publicly with the help of the CentOS infrastructure and made them available to the community.

The effort is led by the CentOS Cloud SIG and I'd like to thank particularly Alan Pevec, Haïkel Guemar and Karanbir Singh for their support and time.

For all the information and how to use the Juno EL6 packages please follow this link https://wiki.centos.org/Cloud/OpenStack/JunoEL6QuickStart.

Tuesday, 12 May 2015

Our cloud in Juno

This blog continues our series around upgrades of OpenStack. Previous upgrades are documented in

At CERN, we do incremental upgrades of the cloud, component by component. Giving details of the problems we encounter along the way.

For Juno, we followed the pattern as previously

cinder
glance
keystone
ceilometer
nova
horizon

As we are now rolling out our CentOS 7 based controllers, we took the opportunity to do that upgrade also. Many of the controllers themselves are virtualised which allows us to scale out as needed. An HA proxy configuration allows us to switch rapidly between the services on the different levels.

The motivation to move to CentOS 7 comes from two primary sources

CERN is moving from its own distribution, Scientific Linux CERN, to CentOS 7.
The RDO packages are being produced on CentOS 7 now. This means that we can benefit from community testing if we are also on that version.

We'll give more details on the SLC6 environment in a future posting.

We encountered one problem during the upgrade. The LDAP backend for roles compared the definition with an upper case role definition. We were using lower case roles in LDAP. This was resolved with a quick workaround and a bug will be reported around https://github.com/openstack/keystone/blob/stable/juno/keystone/assignment/backends/ldap.py#L93.

Other than that, the upgrade proceeded smoothly and we're now looking forward to deploying Heat and starting the planning for the migration to Kilo.

Tuesday, 5 May 2015

Purging Nova databases in a cell environment

CERN cloud infrastructure is running in production since July 2013 and during almost 2 years more than 1,000,000 production VMs were created in the infrastructure.

During the last few months, we have had an average of 11,000 VMs running with a creation/deletion rate between 100 and 200 VMs per hour.

Number of VMs created (cumulative)

Difference between VMs created and deleted per month (cumulative)

The information of all these instances is stored in the database and remains when the instances are deleted because nova uses “soft” delete (mark the records as deleted without removing them). As a consequence, the database size grows overtime being more difficult to manage and increasing operations time.

At CERN, we have the policy to preserve deleted instances information for 3 months in the database before removing them.

Nova has the functionality to move the deleted instances to “shadow” tables (nova-manage db archive_deleted_rows [--max_rows <number>]). This functionality can

remove all deleted entries in the main tables or a maximum number of rows can be specified. However, a row doesn’t mean an instance because some tables have several entries for the same instance. Also, in a cloud that uses cells running the “archive_deleted_rows” with “max_rows” defined will not keep the top and children cells in sync.

In order to remove the database entries of deleted instances in a cell environment with a grace period we developed a small tool that is available at:

https://github.com/cernops/nova-db-purge

We start removing deleted instances from the top database defining a date until the rows should be deleted.

python cern-db-purge --date "2015-02-01 00:00:00" --config nova.conf

Cascading doesn’t work in nova database so the script checks if the instances were deleted before the specified date and removes all the rows associated with them in the different tables. Also, it saves the uuid and some more information about an instance in a file. We decided to not delete the instances from top and children at same time to have more operational control during the interventions.

This file is used to remove the deleted instances from the children cells.

python cern-db-purge --file "delete_these_instances.txt" --cell 'top_cell!child_cell_01' --config nova.conf

The script goes through all instances in the file for that specific child cell and after some consistency checks removes all the rows related with the instances.

In this way we make sure that if an instance is removed from the top cell is also deleted from the children cells.

Depending in the database size the tool can take several hours to run.

This tool needs access to nova database and was tested with Icehouse release. The database endpoint should be defined in the configuration file (it can be nova.conf). Since it reads and updates the database, the administrator should be extremely careful when using it.