Thursday 20 March 2014

CERN Cloud Architecture - Update

In the last OpenStack Design Summit in Hong Kong I presented the CERN Cloud Architecture with the talk “Deep Dive into the CERN Cloud Infrastructure” https://www.openstack.org/summit/openstack-summit-hong-kong-2013/session-videos/presentation/deep-dive-into-the-cern-cloud-infrastructure . Since then the infrastructure grown to a third cell and we enabled ceilometer compute-agent. Because of that we needed to perform some architecture changes to cope with the number of nova-api calls that ceilometer compute-agent generates.

The cloud infrastructure has now more than 50000 cores and after the next hardware delivers expected during the next months, more than 35000 new cores will be added most of them in the remote Computer Centre in Hungary. Also, we continue to migrate existing servers to OpenStack compute nodes at an average of 100 servers per week.

Fig. 1 – High-level view of CERN Cloud Infrastructure
We are using Cells in order to scale the infrastructure and for project distribution.
At the moment we have three Compute Cells. Two are deployed in Geneva, Switzerland and the other in Budapest, Hungary.

All OpenStack services running in the Cell Controllers are behind a Load Balancer and we have at least 3 running instances for each of them. As message broker we are using RabbitMQ clustered with HA queues.

Metering is an important requirement for CERN to account resources and ceilometer is the obvious solution to provide this functionality.
Considering our Cell setup we are running ceilometer api and ceilometer collector in the API Cell and the ceilometer agent-central and ceilometer collector in the Compute Cells.


Fig.2 – OpenStack services that are running in different Cell layers at CERN Cloud infrastructure. At green the new configured components.
In order to get information about the running VMs ceilometer compute-agent calls nova-api. In our initial setup we used the simple approach to use nova-api already running in the API Cell. This means all ceilometer compute-agents will authenticate with keystone in the API Cell and then call nova-api running there. Unfortunately this approach doesn’t work using cells because the bug: https://bugs.launchpad.net/nova/+bug/1211022
Even if ceilometer was not getting the right instance domain and failed to find the VMs in the compute nodes we noticed a huge increase in the number of nova-api calls that were hitting the nova-api servers on API Cell that could degrade user experience.

We then decided to move this load to the Compute Cells enabling nova-api compute there and point all ceilometer compute-agents to them instead. This approach has 3 main advantages:
1) Isolation of nova-api calls per Cell allowing a better dimensioning of Compute Cell controllers and separation between user and ceilometer requests.
2) nova-api on the Compute Cells uses nova Cell databases. Distribute the queries between databases not overloading API Cell database.
3) Because point 2) the VM domain name is now reported correctly.

However to deploy nova-api compute at Compute Cell level we also needed to configure other components: keystone, glance-api and glance-registry.

- Keystone is configured per Compute Cell with the following endpoints (local nova-api and local glance-api). Only service accounts can authenticate with the keystones running in the Compute Cells, users are not allowed.
Configuring keystone per Compute Cell allows us to distribute the nova-api load at Cell level. For ceilometer only Cell databases are used to retrieve instance information. Keystone load is also distributed. Instead using the API Cell keystone used by every user, ceilometer only uses the Compute Cell keystones that are completely isolated from the API Cell. This is especially important because we are not using PKI and our keystone configuration is single threaded.

- From the beginning we are running glance-api at Compute Cell level. This allows us to have image cache in the Compute Cells, which is especially important for the Budapest Computer Centre since Ceph deployment is at Geneva.
Ceilometer compute-agent also queries for image information using nova-api. However we can’t use the existing glance-api because it uses the API Cell keystone for token validation. Because of that we setup other glance-api and glance-registry at Compute Cell level but listening a different port and using the local keystone.

- Nova-api compute is enabled at Compute Cell level. All nova services running in the Compute Cell controllers use the same configuration file “nova.conf”. This means that for nova-api service we needed to overwrite the “glance_api_servers” configuration option to point to the new local glance-api, but keeping the old configuration that is necessary to spawn instances. Nova-api service is using the local keystone. Metadata service is not affected because of that.

We expect that with this distribution if ceilometer starts to overload the infrastructure, user experience will not be affected.
With all these changes in the architecture we now enabled ceilometer compute-agent in all Compute Cells.

Just out of curiosity I would like to finish this blog post with the plot showing the number of the nova-api calls before and after enabling ceilometer compute-agent in all infrastructure. In total it increased more than 14 times.




Friday 7 March 2014

Enable Cinder-multi-backend with an existing Ceph backend.


CERN IT is operating a 3 PetaByte Ceph cluster and one of our use-cases is to store our OpenStack volumes and images. For more details on Ceph cluster, Dan van der Ster's presentation is available at the following link.

After the migration to Havana, we started to provide the volume service to a wider audience and therefore needed to tune our cinder configuration.

This post will show you how we enabled the multi-backend in Cinder. We dealt with the migration of our Ceph volume to the new volume type. And finally we will look at the quality of service we want to enable on our newly created volume type.

We had already, Ceph configured as Default:
[DEFAULT]
...
quota_volumes=0
quota_snapshots=0
volume_driver=cinder.volume.drivers.rbd.RBDDriver
rbd_user=volumes
rbd_pool=volumes
rbd_secret_uuid=00000000-1111-2222-3333-000000000001

We added the following option in /etc/cinder/cinder.conf to enable the multi backend support:

[DEFAULT]
...
enabled_backends=standard
scheduler_driver=cinder.scheduler.filter_scheduler.FilterScheduler
default_volume_type=standard
[standard]
volume_group=standard
rbd_user=volumes
rbd_pool=volumes
volume_backend_name=standard
volume_driver=cinder.volume.drivers.rbd.RBDDriver
rbd_secret_uuid=00000000-1111-2222-3333-000000000001

Create the volume type:
# cinder type-create standard
# cinder type-key standard set volume_backend_name=standard

To verify the type has been created you can run:

# cinder extra-specs-list
+--------------------------------------+----------+---------------------------------------+
|                  ID                  |   Name   |              extra_specs              |
+--------------------------------------+----------+---------------------------------------+
| c6ad034a-5d97-443b-97c6-58a8744bf99b | standard | {u'volume_backend_name': u'standard'} |
+--------------------------------------+----------+---------------------------------------+

A restart is needed after these changes:

# for i in volume api schedule; do service restart openstack-cinder-$i ; done

At this point it is not possible to attach/detach the volumes you created without a type.
Nova will fail with the following error:

nova.openstack.common.notifier.rpc_notifier ValueError: Circular reference detected

To fix it, we had to update manually the database, with two steps:
mysqldump cinder (don't forget :)
  • Update the volume_type_id column with the output of cinder extra-specs-list :
update volumes set volume_type_id="c6ad034a-5d97-443b-97c6-58a8744bf99b" where volume_type_id is NULL; 
  • For each controller the host column needs an update:
update volumes set  host='p01@standard' where host='p01';
update volumes set  host='p02@standard' where host='p02'; 

All volume are now of type standard and can be operate as usual. Be sure to have "default_volume_type" defined in your cinder.conf otherwise it will default to 'None' and these volumes will not be functional.

The last step is to delete from your DEFAULT section the old volume settings.

Since we have a large number of disks in the ceph store, there is a very high potential capacity for IOPS but we want to be sure that individual VMs cannot monopolise this capacity. This involves enabling the Quality-of-Service features in cinder.

Enabling QoS is straight forward:
# cinder qos-create standard-iops consumer="front-end" read_iops_sec=400 write_iops_sec=200
# cinder qos-associate 10a7b93c-38d7-4061-bfb8-78d01e2fe6d8 c6ad034a-5d97-443b-97c6-58a8744bf99b
If you want to add additional limits:
# cinder qos-key 10a7b93c-38d7-4061-bfb8-78d01e2fe6d8 set read_bytes_sec=80000000
# cinder qos-key 10a7b93c-38d7-4061-bfb8-78d01e2fe6d8 set write_bytes_sec=40000000
# cinder qos-list
For the QoS parameters to be activated for an existing volume, you need to detach and reattach the old volumes.

This information is collected from the cloud team from CERN and a big thank you to the Ceph team for the help.