OpenStack in Production - Archives: June 2017

The CERN OpenStack cloud service is providing block storage via Cinder since Havana days in early 2014. Users can choose from seven different volume types, which offer different physical locations, different power feeds, and different performance characteristics. All volumes are backed by Ceph, deployed in three separate clusters across two data centres.

Due to its flexibility, the volume concept has become very popular with users and the service has hence grown during the past years to over 1PB of allocated quota, hosted in more than 4'000 volumes. In this post, we'd like to share some of the features we use and point out some of the pitfalls we've run into when running (a very stable and easy to maintain) Cinder service in production.

Avoiding sadness: Understanding the 'host' parameter

With the intent to increase the resiliency, we configured the service from the start to run on multiple hosts. The three controller nodes were set up in an identical way, so all of them ran the API ('c-api'), scheduler ('c-sched') and volume ('c-vol') services.

With the first upgrades, however, we realised that there was a coupling between a volume and the 'c-vol' service that had created it: each volume is associated with its creation host which, by default, is identified by hostname of the controller. So, when the first controller needed to be replaced, the 'c-sched' wasn't able to find the original 'c-vol' service which would be able to execute volume operations. At the time, we fixed this by changing the corresponding volume entries in the Cinder database to point to the new host that was added.

As the Cinder configuration allows the 'host' to be set directly in 'cinder.conf', we set this parameter to be the same on all controllers with the idea to remove the coupling between the volume and the specific 'c-vol' which was used to create it. We ran like this for quite a while and although we never saw direct issues related to this setting, in hindsight it may explain some of the issues we had with volumes getting stuck in transitional states. The main problem here is the clean-up being done as the daemons start up: as they assume exclusive access to 'their' volumes, volumes in transient states will be "cleaned up", e.g. their state reset, when a daemon starts, so in a setup with identical 'host's, this may cause undesired interferences.

Taking this into account, our setup has been changed to keep the 'c-api' on all three controller, but run the 'c-vol' and 'c-sched' services on one host only. Closely following the recent work of the Cinder team to improve the locking and allow for Active/Active HA we're looking forward to have Active/Active HA 'c-vol' services fully available again.

Using multiple volume types: QoS and quota classes

The scarce resource on our Ceph backend is not space, but IOPS, and after we handed out the first volumes to users, we quickly realised that some resource management was needed. We achieved this by creating a QoS spec and associating it with the one volume type we had at the time:

# cinder qos-create std-iops write_iops_sec=100 read_iops_sec=100
# cinder qos-associate <std_iops_qos_id> <std_volume_type_id>

This setting does not only allow you to limit the amount of IOPS used on this volume type, but also to define different service levels. For instance, for more demanding use cases we added a high IOPS volume type to which access is granted on a per request basis:

# cinder type-create high-iops
# cinder qos-create high-iops write_iops_sec=500 read_iops_sec=500
# cinder qos-associate <high_iops_qos_id> <high_iops_volume_type_id>

Note that both types are provided by the same backend and physical hardware (which also allows for a conversion without data movement between these types using 'cinder retype')! Note also that for attached volumes a detach/re-attach cycle is needed to have QoS changes taking effect.

In order to manage the initial default quotas for these two (and the other five volume types the service offers), we use Cinder's support for quota classes. As apart from the std-iops volume type all other volume types are only available on request, the initial quota is usually set to '0'. So, in order to create the default quotas for a new type, we would hence update the default quota class by running a command like:

# cinder type-create new-type
# cinder quota-class-update --volume-type new-type --volumes 0 --snapshots 0 --gigabytes 0 default

Of course, this method can also be used to define different initial quotas for new volume types, but it is in any case a way to avoid setting the initial quotas explicitly after project creation.

Fixing a long-standing issue: Request timeouts and DB deadlocks

For quite some time, our Cinder deployment had suffered from request timeouts leading to volumes left in error states when doing parallel deletions. Though easily reproducible, this was infrequent (and subsequently received the corresponding attention ...). Recently, however, this became a much more severe issue with the increased use of Magnum and Kubernetes clusters (which use volumes and hence launch parallel volumes deletions at larger scale when being removed). This affected the overall service availability (and, subsequently, received the corresponding attention here as well ...).

In this situations, the 'c-vol' logs showed lines like

"Deadlock detected when running 'reservation_commit': Retrying ..."

and hence indicated locking problem. We weren't able to pinpoint in the code how a deadlock would occur, though. A first change that mitigated the situation was to reduce the 'innodb_lock_wait_timeout' from its default value of 50 seconds to 1 second: the client was less patient and exercised the retry logic decorates the database interactions much earlier. Clearly, this did not address the underlying problem, but at least allowed the service to handle these parallel deletions in a much better way.

The real fix, suggested by a community member, implied to try and change a setting we had carried forward since the initial setup of the service: the connection string in 'cinder.conf' was not specifying a driver and hence using the mysql Python wrapper (rather than the recommended 'pymysql' Python implementation). After changing our connection from

connection = mysql://cinder:<pw>@<host>:<port>/cinder

to

connection = mysql+pymysql://cinder:<pw>@<host>:<port>/cinder

the problem basically disappeared!

So the underlying reason was the management of the green thread parallelism in the wrapper vs. the native Python implementation: while the former enforces serialisation (and hence eventually deadlocks in SQLAlchemy), the latter allows for proper parallel execution of the requests to the database. The OpenStack oslo team is now looking into issuing a warning when it detects this obsolete setting.

As using the 'pymysql' driver is generally recommended and, for instance, default in devstack deployments, volunteers to help with this issue had a really hard time to reproduce the issues we experienced ... another lesson learnt when keeping services running for a longer period :)

At the recent summit in Boston, Doug Hellmann and I were discussing research around OpenStack, both the software itself but also how it is used by applications. There are many papers being published in proceedings of conferences and PhD theses but finding out about these can be difficult. While these papers may not necessarily lead to open source code contribution, the results of this research is a valuable resource for the community.

Increasingly, publications are made with Open Access conditions which are free of all restrictions on access. For example, all projects receiving European Union Horizon 2020 funding are required to make sure that any peer-reviewed journal article they publish is openly accessible, free of charge. Reviewing with the OpenStack scientific working group, Open access was also felt to be consistent with OpenStack's Open principles of Open Source, Open Design, Open Development and Open Community.

There are a number of different repositories available where publications such as this can be made available. The OpenStack scientific working group are evaluating potential approaches and Zenodo looks like a good candidate as it is already widely used in the research community, open source on github and the application also runs in the CERN Data Centre on OpenStack. Preservation of data is one of CERN's key missions and this is included in the service delivery for Zenodo.

The name Zenodo is derived from Zenodotus, the first librarian of the Ancient Library of Alexandria and father of the first recorded use of metadata, a landmark in library history.

Accessing the Repository

The list of papers can be seen at https://zenodo.org/communities/openstack-papers. Along with keywords, there is a dedicated search facility is available within the community so that relevant papers can be found quickly.

Submitting New Papers

Zenodo allows new papers to be submitted for inclusion into the OpenStack Papers repository. There are a number of steps to be performed.

Please ensure that these papers are available under open access conditions before submitting them to the repository if published elsewhere. Alternatively, if the papers can be published freely, they can be published in Zenodo for the first time and receive the DOI directly.

Log in to Zenodo. This can be done using your github account if you have one or by registering a new account via the 'Sign Up' button.
Once logged in, you can go to the openstack repository at https://zenodo.org/communities/openstack-papers and upload a new paper.
The submission will then be verified before publishing.

To submit for this repository, you need to provide

Title of the paper
Author list
Description (the Abstract is often a good content)
Date of publication

If you know the information, please provide the following also

DOI (Digital Object Identifier) used to uniquely identify the object. In general, these will already be allocated to the paper since the original publication will have allocated one. If none is specified, Zenodo will create one, which is good for new publications but bad practice to generate duplicate DOIs for published works. So please try to find the original, which also it helps with future cross referencing.
There are optional fields at upload time for adding more metadata (to make it machine readable), such as “Journal” and “Conference”. Adding journal information improves the searching and collating of documents for the future so if this information is known, it is good to enter it.

Zenodo provides the synchronisation facilities for repositories to exchange information (OAI 2.0). Planet OpenStack feeds using this would be an interesting enhancement to consider or adding RSS support to Zenodo would be welcome contributions.

OpenStack in Production - Archives

LHC Tunnel

Friday, 9 June 2017

Experiences with Cinder in Production

Avoiding sadness: Understanding the 'host' parameter

Using multiple volume types: QoS and quota classes

Fixing a long-standing issue: Request timeouts and DB deadlocks

Tuesday, 6 June 2017

OpenStack papers community on Zenodo

Accessing the Repository

Submitting New Papers