Due to its flexibility, the volume concept has become very popular with users and the service has hence grown during the past years to over 1PB of allocated quota, hosted in more than 4'000 volumes. In this post, we'd like to share some of the features we use and point out some of the pitfalls we've run into when running (a very stable and easy to maintain) Cinder service in production.
Avoiding sadness: Understanding the 'host' parameter
With the intent to increase the resiliency, we configured the service from the start to run on multiple hosts. The three controller nodes were set up in an identical way, so all of them ran the API ('c-api'), scheduler ('c-sched') and volume ('c-vol') services.
With the first upgrades, however, we realised that there was a coupling between a volume and the 'c-vol' service that had created it: each volume is associated with its creation host which, by default, is identified by hostname of the controller. So, when the first controller needed to be replaced, the 'c-sched' wasn't able to find the original 'c-vol' service which would be able to execute volume operations. At the time, we fixed this by changing the corresponding volume entries in the Cinder database to point to the new host that was added.
As the Cinder configuration allows the 'host' to be set directly in 'cinder.conf', we set this parameter to be the same on all controllers with the idea to remove the coupling between the volume and the specific 'c-vol' which was used to create it. We ran like this for quite a while and although we never saw direct issues related to this setting, in hindsight it may explain some of the issues we had with volumes getting stuck in transitional states. The main problem here is the clean-up being done as the daemons start up: as they assume exclusive access to 'their' volumes, volumes in transient states will be "cleaned up", e.g. their state reset, when a daemon starts, so in a setup with identical 'host's, this may cause undesired interferences.
Taking this into account, our setup has been changed to keep the 'c-api' on all three controller, but run the 'c-vol' and 'c-sched' services on one host only. Closely following the recent work of the Cinder team to improve the locking and allow for Active/Active HA we're looking forward to have Active/Active HA 'c-vol' services fully available again.
Using multiple volume types: QoS and quota classes
The scarce resource on our Ceph backend is not space, but IOPS, and after we handed out the first volumes to users, we quickly realised that some resource management was needed. We achieved this by creating a QoS spec and associating it with the one volume type we had at the time:
# cinder qos-create std-iops write_iops_sec=100 read_iops_sec=100
# cinder qos-associate <std_iops_qos_id> <std_volume_type_id>
This setting does not only allow you to limit the amount of IOPS used on this volume type, but also to define different service levels. For instance, for more demanding use cases we added a high IOPS volume type to which access is granted on a per request basis:
# cinder type-create high-iops
# cinder qos-create high-iops write_iops_sec=500 read_iops_sec=500
# cinder qos-associate <high_iops_qos_id> <high_iops_volume_type_id>
Note that both types are provided by the same backend and physical hardware (which also allows for a conversion without data movement between these types using 'cinder retype')! Note also that for attached volumes a detach/re-attach cycle is needed to have QoS changes taking effect.
In order to manage the initial default quotas for these two (and the other five volume types the service offers), we use Cinder's support for quota classes. As apart from the std-iops volume type all other volume types are only available on request, the initial quota is usually set to '0'. So, in order to create the default quotas for a new type, we would hence update the default quota class by running a command like:
# cinder type-create new-type
# cinder quota-class-update --volume-type new-type --volumes 0 --snapshots 0 --gigabytes 0 default
Of course, this method can also be used to define different initial quotas for new volume types, but it is in any case a way to avoid setting the initial quotas explicitly after project creation.
Fixing a long-standing issue: Request timeouts and DB deadlocks
For quite some time, our Cinder deployment had suffered from request timeouts leading to volumes left in error states when doing parallel deletions. Though easily reproducible, this was infrequent (and subsequently received the corresponding attention ...). Recently, however, this became a much more severe issue with the increased use of Magnum and Kubernetes clusters (which use volumes and hence launch parallel volumes deletions at larger scale when being removed). This affected the overall service availability (and, subsequently, received the corresponding attention here as well ...).
In this situations, the 'c-vol' logs showed lines like
"Deadlock detected when running 'reservation_commit': Retrying ..."
and hence indicated locking problem. We weren't able to pinpoint in the code how a deadlock would occur, though. A first change that mitigated the situation was to reduce the 'innodb_lock_wait_timeout' from its default value of 50 seconds to 1 second: the client was less patient and exercised the retry logic decorates the database interactions much earlier. Clearly, this did not address the underlying problem, but at least allowed the service to handle these parallel deletions in a much better way.
The real fix, suggested by a community member, implied to try and change a setting we had carried forward since the initial setup of the service: the connection string in 'cinder.conf' was not specifying a driver and hence using the mysql Python wrapper (rather than the recommended 'pymysql' Python implementation). After changing our connection from
connection = mysql://cinder:<pw>@<host>:<port>/cinder
to
connection = mysql+pymysql://cinder:<pw>@<host>:<port>/cinder
the problem basically disappeared!
So the underlying reason was the management of the green thread parallelism in the wrapper vs. the native Python implementation: while the former enforces serialisation (and hence eventually deadlocks in SQLAlchemy), the latter allows for proper parallel execution of the requests to the database. The OpenStack oslo team is now looking into issuing a warning when it detects this obsolete setting.
As using the 'pymysql' driver is generally recommended and, for instance, default in devstack deployments, volunteers to help with this issue had a really hard time to reproduce the issues we experienced ... another lesson learnt when keeping services running for a longer period :)