Tuesday, 21 October 2014

Kerberos and Single Sign On with OpenStack

External Authentication with Keystone

One of the most commonly requested features by the CERN cloud user community is support for authentication using Kerberos on the command line and single-sign on with the OpenStack dashboard.

In our Windows and Linux environment, we run Active Directory to provide authentication services. During the Essex cycle of OpenStack, we added support for getting authentication based on Active Directory passwords. However, this had several drawbacks:
  • When using the command line clients, the users had the choice of storing their password in environment variables such as with the local openrc script or re-typing their password with each OpenStack command. Passwords in environment variables has significant security risks since they are passed to any sub-command and can be read by the system administrator of the server you are on.
  • When logging in with the web interface, the users were entering their password into the dashboard. Most of CERN's applications use a single sign on package with Active Directory Federation Services (ADFS). Recent problems such as Heartbleed show the risks of entering passwords into web applications.
The following describes how we configured this functionality.

Approach

With our upgrade to Icehouse completed last week, the new release of the v3 identity API, Keystone now supports several authentication mechanisms through plugins. By default password, token and external authentication were provided. In this scenario, other authentication methods such Kerberos or X.509 can be used with a proper apache configuration and the external plugin provided in keystone. Unfortunately, when enabling these methods on apache, there is no way to make them optional so the client can choose the most appropriate.

Also when checking the projects he can access, the client normally does two operations on keystone, one to retrieve the token, and the other one with the token to retrieve the project list. Even if it is specified in the environment variables, the second call always uses the catalog, so if in the catalog has version 2 and we are using version 3 then we have an exception while doing the API call.

Requirements

In this case we need a solution that allows us to use Kerberos, X.509 or another authentication mechanism in a transparent way and also backwards compatible, so we can offer both APIs and let the user choose which is the most appropriate for its workflow. This will allow us to migrate services from one API version to the next one with no downtime.

In order to allow external authentication to our clients, we need to cover two parts, client side and server side. Client side to distinguish which is the auth plugin to use, and Server side to allow multiple auth methods and API versions at once.

Server Solution

In order to have different entry points under the same api, we would need a load balancer, in this particular case we use HAproxy. From this load balancer we are calling two different sets of backend machines, one for version 2 of the API and the other for version 3. In this loadbalancer, we can analyze the version of the url where the client is connecting to so we can redirect him to the appropriate set. Each backend is running keystone under apachea and it is connected to the same database. We need this to allow tokens to be validated no matter the version is used on the client. The only difference between the backend sets is the catalog, the identity service is different on both pointing the client to the available version on each set. For this particular purpose we will use a templatedcatalog.


Right now we solve the multiversion issue of the OpenStack environment, but we didn't allow Kerberos or X.509. As these methods are not optional we may need different entry points for each authentication plugin used. So we need entry points for OpenStack authentication (password, token), Kerberos and X.509. There is no issue with the catalog if we enable these methods, all of them can be registered on the service catalog like normal OpenStack authentication, because any consequent call on the system will use token based authentication.
So in the apache v3 backend we have the following urls defined:

https://mykeystone/main/v3
https://mykeystone/admin/v3
https://mykeystone/krb/v3
https://mykeystone/x509/v3

If you post an authentication request to the Kerberos url, this will require a valid Kerberos token, in case it is not sent it will initiate a challenge. After validating it, it will it the user as the REMOTE_USER. In case of client certificate authentication, you will use the X.509 url that will require a valid certificate, in this case it will use the DN as the REMOTE_USER. After this variable is set, then Keystone can take over and check the user in the Keystone database.
There is a small caveat, we cannot do offloading of SSL client authentication on the HAproxy, so for this purpose we need to connect directly from the client, it uses a different port 8443 and connects directly to the backends configured. So for X.509 authentication we use 'https://mykeystone:8443/x509/v3'

Client Solution

For the client side, the plugin mechanism will only be available on the common cli (python-openstackclient) and not on the rest of the toolset (nova, glance, cinder, ...). There is no code yet that implements the plugin functionality, so in order to provide a short term implementation, and based on our current architecture, we can base it the selection of the plugin on the OS_AUTH_URL for the moment. The final upstream implementation will almost certainly differ at this point by using a parameter or discover the auth plugins available. In that case the client implementation may change but this is likely to be close to the initial implementation.

In openstackclient/common/clientmanager.py
...
        if 'krb' in auth_url and ver_prefix == 'v3':
            LOG.debug('Using kerberos auth %s', ver_prefix)
            self.auth = v3_auth_kerberos.Kerberos(
                auth_url=auth_url,
                trust_id=trust_id,
                domain_id=domain_id,
                domain_name=domain_name,
                project_id=project_id,
                project_name=project_name,
                project_domain_id=project_domain_id,
                project_domain_name=project_domain_name,
            )
        elif 'x509' in auth_url and ver_prefix == 'v3':
            LOG.debug('Using x509 auth %s', ver_prefix)
            self.auth = v3_auth_x509.X509(
                auth_url=auth_url,
                trust_id=trust_id,
                domain_id=domain_id,
                domain_name=domain_name,
                project_id=project_id,
                project_name=project_name,
               project_domain_id=project_domain_id,
                project_domain_name=project_domain_name,
                client_cert=client_cert,
            )
        elif self._url:
...

HAproxy configuration

global
  chroot  /var/lib/haproxy
  daemon
  group  haproxy
  log  mysyslogserver local0
  maxconn  8000
  pidfile  /var/run/haproxy.pid
  ssl-default-bind-ciphers  ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128:AES256:AES:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK
  stats  socket /var/lib/haproxy/stats
  tune.ssl.default-dh-param  2048
  user  haproxy

defaults
  log  global
  maxconn  8000
  mode  http
  option  redispatch
  option  http-server-close
  option  contstats
  retries  3
  stats  enable
  timeout  http-request 10s
  timeout  queue 1m
  timeout  connect 10s
  timeout  client 1m
  timeout  server 1m
  timeout  check 10s

frontend cloud_identity_api_production
  bind 188.184.148.158:443 ssl no-sslv3 crt /etc/haproxy/cert.pem verify none
  acl  v2_acl_admin url_beg /admin/v2
  acl  v2_acl_main url_beg /main/v2
  default_backend  cloud_identity_api_v3_production
  timeout  http-request 5m
  timeout  client 5m
  use_backend  cloud_identity_api_v2_production if v2_acl_admin
  use_backend  cloud_identity_api_v2_production if v2_acl_main

frontend cloud_identity_api_x509_production
  bind 188.184.148.158:8443 ssl no-sslv3 crt /etc/haproxy/cert.pem ca-file /etc/haproxy/ca.pem verify required
  default_backend  cloud_identity_api_v3_production
  rspadd  Strict-Transport-Security:\ max-age=15768000
  timeout  http-request 5m
  timeout  client 5m
  use_backend  cloud_identity_api_v3_production if { ssl_fc_has_crt }

backend cloud_identity_api_v2_production
  balance  roundrobin
  stick  on src
  stick-table  type ip size 20k peers cloud_identity_frontend_production
  timeout  server 5m
  timeout  queue 5m
  timeout  connect 5m
  server cci-keystone-bck01 128.142.132.22:443 check ssl verify none
  server cci-keystone-bck02 188.184.149.124:443 check ssl verify none
  server p01001453s11625 128.142.174.37:443 check ssl verify none

backend cloud_identity_api_v3_production
  balance  roundrobin
  http-request  set-header X-SSL-Client-CN %{+Q}[ssl_c_s_dn(cn)]
  stick  on src
  stick-table  type ip size 20k peers cloud_identity_frontend_production
  timeout  server 5m
  timeout  queue 5m
  timeout  connect 5m
  server cci-keystone-bck03 128.142.159.38:443 check ssl verify none
  server cci-keystone-bck04 128.142.164.244:443 check ssl verify none
  server cci-keystone-bck05 128.142.132.192:443 check ssl verify none
  server cci-keystone-bck06 128.142.146.182:443 check ssl verify none

listen stats
  bind 188.184.148.158:8080
  stats  uri /
  stats  auth haproxy:toto1TOTO$

peers cloud_identity_frontend_production
  peer cci-keystone-load01.cern.ch 188.184.148.158:7777
  peer cci-keystone-load02.cern.ch 128.142.153.203:7777
  peer p01001464675431.cern.ch 128.142.190.8:7777
Apache configuration
WSGISocketPrefix /var/run/wsgi

Listen 443

<VirtualHost *:443>
  ServerName keystone.cern.ch
  DocumentRoot /var/www/cgi-bin/keystone
  LimitRequestFieldSize 65535

  SSLEngine On
  SSLCertificateFile      /etc/keystone/ssl/certs/hostcert.pem
  SSLCertificateKeyFile   /etc/keystone/ssl/keys/hostkey.pem
  SSLCertificateChainFile /etc/keystone/ssl/certs/ca.pem
  SSLCACertificateFile    /etc/keystone/ssl/certs/ca.pem
  SSLVerifyClient         none
  SSLOptions              +StdEnvVars
  SSLVerifyDepth          10
  SSLUserName             SSL_CLIENT_S_DN_CN
  SSLProtocol             all -SSLv2 -SSLv3

  SSLCipherSuite          ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128:AES256:AES:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK
  SSLHonorCipherOrder     on
  Header add Strict-Transport-Security "max-age=15768000"


  WSGIDaemonProcess keystone user=keystone group=keystone processes=2 threads=2
  WSGIProcessGroup keystone

  WSGIScriptAlias /admin /var/www/cgi-bin/keystone/admin
  <Location "/admin">
    SSLRequireSSL
    SSLVerifyClient       none
  </Location>

  WSGIScriptAlias /main /var/www/cgi-bin/keystone/main
  <Location "/main">
    SSLRequireSSL
    SSLVerifyClient       none
  </Location>

  WSGIScriptAlias /krb /var/www/cgi-bin/keystone/main

  <Location "/krb">
    SSLRequireSSL
    SSLVerifyClient       none
  </Location>

  <Location "/krb/v3/auth/tokens">
    SSLRequireSSL
    SSLVerifyClient       none
    AuthType              Kerberos
    AuthName              "Kerberos Login"
    KrbMethodNegotiate    On
    KrbMethodK5Passwd     Off
    KrbServiceName        Any
    KrbAuthRealms         CERN.CH
    Krb5KeyTab            /etc/httpd/http.keytab
    KrbVerifyKDC          Off
    KrbLocalUserMapping   On
    KrbAuthoritative      On
    Require valid-user
  </Location>

  WSGIScriptAlias /x509 /var/www/cgi-bin/keystone/main

  <Location "/x509">
    Order allow,deny
    Allow from all
  </Location>

  WSGIScriptAliasMatch ^(/main/v3/OS-FEDERATION/identity_providers/.*?/protocols/.*?/auth)$ /var/www/cgi-bin/keystone/main/$1

  <LocationMatch /main/v3/OS-FEDERATION/identity_providers/.*?/protocols/saml2/auth>
    ShibRequestSetting requireSession 1
    AuthType shibboleth
    ShibRequireSession On
    ShibRequireAll On
    ShibExportAssertion Off
    Require valid-user
  </LocationMatch>

  <LocationMatch /main/v3/OS-FEDERATION/websso>
    ShibRequestSetting requireSession 1
    AuthType shibboleth
    ShibRequireSession On
    ShibRequireAll On
    ShibExportAssertion Off
    Require valid-user
  </LocationMatch>

  <Location /Shibboleth.sso>
    SetHandler shib
  </Location>

  <Directory /var/www/cgi-bin/keystone>
    Options FollowSymLinks
    AllowOverride All
    Order allow,deny
    Allow from all
  </Directory>
</VirtualHost>

References

The code of python-openstackclient as long as the python-keystoneclient that we are using for this implementation is available at:


We will be working with the community in the Paris summit to find the best way to integrate this functionality into the standard OpenStack release.

Credits

The main author is Jose Castro Leon with help from Marek Denis.

Many thanks to the Keystone core team for their help and advice on the implementation.

Saturday, 19 July 2014

OpenStack plays Tetris : Stacking and Spreading a full private cloud

At CERN, we're running a large scale private cloud which is providing compute resources for physicists analysing the data from the Large Hadron Collider. With 100s of VMs created per day, the OpenStack scheduler has to perform a Tetris like job to assign the different flavors of VMs falling to the specific hypervisors.

As we increase the number of VMs that we're running on the CERN cloud, we see the impact of a number of configuration choices made early on in the cloud deployment. One key choice is how to schedule VMs across a pool of hypervisors.

We provide our users with a mixture of flavors for their VMs (for details, see http://openstack-in-production.blogspot.fr/2013/08/flavors-english-perspective.html).

During the past year in production, we have seen a steady growth in the number of instances to nearly 7,000.


At the same time, we're seeing an increasing elastic load as the user community explores potential ways of using clouds for physics.



Given that CERN has a fixed resource pool and the budget available is defined and fixed, the underlying capacity is not elastic and we are now starting to encounter scenarios where the private cloud can become full. Users see this as errors when they request VMs that no free hypervisor could be located.

This situation occurs more frequently for the large VMs. Physics programs can make use of multiple cores to process physics events in parallel and our batch system (which runs on VMs) benefits from a smaller number of hosts. This accounts for a significant number of large core VMs.


The problem occurs as the cloud approaches being full. Using the default OpenStack configuration (known as 'spread'), VMs are evenly distributed across the hypervisors. If the cloud is running at low utilisation, this is an attractive configuration as CPU and I/O load are also spread and little hardware is left idle.

However, as the utilisation of the cloud increases, the resources free on each hypervisor are reduced evenly. To take a simple case, a cloud with two compute nodes of 24 cores handling a variety of flavors. If there are requests for two 1-core VMs followed by one 24 core flavor, the alternative approaches can be simulated.

In a spread configuration,
  • The first VM request lands on hypervisor A leaving A with 23 cores available and B with 24 cores
  • The second VM request arrives and following the policy to spread the usage, this is scheduled to hypervisor B, leaving A and B with 23 cores available.
  • The request for one 24 core flavor arrives and no hypervisor can satisfy it despite there being 46 cores available and only 4% of the cloud used.
In the stacked configuration,

  • The first VM request lands on hypervisor A leaving A with 23 cores available and B with 24 cores
  • The second VM request arrives and following the policy to stack the usage, this is scheduled to hypervisor A, leaving A with 22 cores and B with 24 cores available.
  • The request for one 24 core flavor arrives and is satisfied by B
A stacked configuration is configured using the RAM weight being negative (i.e. prefer machines with less RAM). This has the effect to pack the VMs. This is done through a nova.conf setting as follows

ram_weight_multiplier=-1.0


When a cloud is initially being set up, the question of maximum packing does not often come up in the early days. However, once the cloud has workload running under spread, it can be disruptive to move to stacked since the existing VMs will not be moved to match the new policy.

Thus, it is important as part of the cloud planning to reflect on the best approach for each different cloud use case and avoid more complex resource rebalancing at a later date.

References

  • OpenStack configuration reference for scheduling at http://docs.openstack.org/trunk/config-reference/content/section_compute-scheduler.html



Thursday, 20 March 2014

CERN Cloud Architecture - Update

In the last OpenStack Design Summit in Hong Kong I presented the CERN Cloud Architecture with the talk “Deep Dive into the CERN Cloud Infrastructure” https://www.openstack.org/summit/openstack-summit-hong-kong-2013/session-videos/presentation/deep-dive-into-the-cern-cloud-infrastructure . Since then the infrastructure grown to a third cell and we enabled ceilometer compute-agent. Because of that we needed to perform some architecture changes to cope with the number of nova-api calls that ceilometer compute-agent generates.

The cloud infrastructure has now more than 50000 cores and after the next hardware delivers expected during the next months, more than 35000 new cores will be added most of them in the remote Computer Centre in Hungary. Also, we continue to migrate existing servers to OpenStack compute nodes at an average of 100 servers per week.

Fig. 1 – High-level view of CERN Cloud Infrastructure
We are using Cells in order to scale the infrastructure and for project distribution.
At the moment we have three Compute Cells. Two are deployed in Geneva, Switzerland and the other in Budapest, Hungary.

All OpenStack services running in the Cell Controllers are behind a Load Balancer and we have at least 3 running instances for each of them. As message broker we are using RabbitMQ clustered with HA queues.

Metering is an important requirement for CERN to account resources and ceilometer is the obvious solution to provide this functionality.
Considering our Cell setup we are running ceilometer api and ceilometer collector in the API Cell and the ceilometer agent-central and ceilometer collector in the Compute Cells.


Fig.2 – OpenStack services that are running in different Cell layers at CERN Cloud infrastructure. At green the new configured components.
In order to get information about the running VMs ceilometer compute-agent calls nova-api. In our initial setup we used the simple approach to use nova-api already running in the API Cell. This means all ceilometer compute-agents will authenticate with keystone in the API Cell and then call nova-api running there. Unfortunately this approach doesn’t work using cells because the bug: https://bugs.launchpad.net/nova/+bug/1211022
Even if ceilometer was not getting the right instance domain and failed to find the VMs in the compute nodes we noticed a huge increase in the number of nova-api calls that were hitting the nova-api servers on API Cell that could degrade user experience.

We then decided to move this load to the Compute Cells enabling nova-api compute there and point all ceilometer compute-agents to them instead. This approach has 3 main advantages:
1) Isolation of nova-api calls per Cell allowing a better dimensioning of Compute Cell controllers and separation between user and ceilometer requests.
2) nova-api on the Compute Cells uses nova Cell databases. Distribute the queries between databases not overloading API Cell database.
3) Because point 2) the VM domain name is now reported correctly.

However to deploy nova-api compute at Compute Cell level we also needed to configure other components: keystone, glance-api and glance-registry.

- Keystone is configured per Compute Cell with the following endpoints (local nova-api and local glance-api). Only service accounts can authenticate with the keystones running in the Compute Cells, users are not allowed.
Configuring keystone per Compute Cell allows us to distribute the nova-api load at Cell level. For ceilometer only Cell databases are used to retrieve instance information. Keystone load is also distributed. Instead using the API Cell keystone used by every user, ceilometer only uses the Compute Cell keystones that are completely isolated from the API Cell. This is especially important because we are not using PKI and our keystone configuration is single threaded.

- From the beginning we are running glance-api at Compute Cell level. This allows us to have image cache in the Compute Cells, which is especially important for the Budapest Computer Centre since Ceph deployment is at Geneva.
Ceilometer compute-agent also queries for image information using nova-api. However we can’t use the existing glance-api because it uses the API Cell keystone for token validation. Because of that we setup other glance-api and glance-registry at Compute Cell level but listening a different port and using the local keystone.

- Nova-api compute is enabled at Compute Cell level. All nova services running in the Compute Cell controllers use the same configuration file “nova.conf”. This means that for nova-api service we needed to overwrite the “glance_api_servers” configuration option to point to the new local glance-api, but keeping the old configuration that is necessary to spawn instances. Nova-api service is using the local keystone. Metadata service is not affected because of that.

We expect that with this distribution if ceilometer starts to overload the infrastructure, user experience will not be affected.
With all these changes in the architecture we now enabled ceilometer compute-agent in all Compute Cells.

Just out of curiosity I would like to finish this blog post with the plot showing the number of the nova-api calls before and after enabling ceilometer compute-agent in all infrastructure. In total it increased more than 14 times.




Friday, 7 March 2014

Enable Cinder-multi-backend with an existing Ceph backend.


CERN IT is operating a 3 PetaByte Ceph cluster and one of our use-cases is to store our OpenStack volumes and images. For more details on Ceph cluster, Dan van der Ster's presentation is available at the following link.

After the migration to Havana, we started to provide the volume service to a wider audience and therefore needed to tune our cinder configuration.

This post will show you how we enabled the multi-backend in Cinder. We dealt with the migration of our Ceph volume to the new volume type. And finally we will look at the quality of service we want to enable on our newly created volume type.

We had already, Ceph configured as Default:
[DEFAULT]
...
quota_volumes=0
quota_snapshots=0
volume_driver=cinder.volume.drivers.rbd.RBDDriver
rbd_user=volumes
rbd_pool=volumes
rbd_secret_uuid=00000000-1111-2222-3333-000000000001

We added the following option in /etc/cinder/cinder.conf to enable the multi backend support:

[DEFAULT]
...
enabled_backends=standard
scheduler_driver=cinder.scheduler.filter_scheduler.FilterScheduler
default_volume_type=standard
[standard]
volume_group=standard
rbd_user=volumes
rbd_pool=volumes
volume_backend_name=standard
volume_driver=cinder.volume.drivers.rbd.RBDDriver
rbd_secret_uuid=00000000-1111-2222-3333-000000000001

Create the volume type:
# cinder type-create standard
# cinder type-key standard set volume_backend_name=standard

To verify the type has been created you can run:

# cinder extra-specs-list
+--------------------------------------+----------+---------------------------------------+
|                  ID                  |   Name   |              extra_specs              |
+--------------------------------------+----------+---------------------------------------+
| c6ad034a-5d97-443b-97c6-58a8744bf99b | standard | {u'volume_backend_name': u'standard'} |
+--------------------------------------+----------+---------------------------------------+

A restart is needed after these changes:

# for i in volume api schedule; do service restart openstack-cinder-$i ; done

At this point it is not possible to attach/detach the volumes you created without a type.
Nova will fail with the following error:

nova.openstack.common.notifier.rpc_notifier ValueError: Circular reference detected

To fix it, we had to update manually the database, with two steps:
mysqldump cinder (don't forget :)
  • Update the volume_type_id column with the output of cinder extra-specs-list :
update volumes set volume_type_id="c6ad034a-5d97-443b-97c6-58a8744bf99b" where volume_type_id is NULL; 
  • For each controller the host column needs an update:
update volumes set  host='p01@standard' where host='p01';
update volumes set  host='p02@standard' where host='p02'; 

All volume are now of type standard and can be operate as usual. Be sure to have "default_volume_type" defined in your cinder.conf otherwise it will default to 'None' and these volumes will not be functional.

The last step is to delete from your DEFAULT section the old volume settings.

Since we have a large number of disks in the ceph store, there is a very high potential capacity for IOPS but we want to be sure that individual VMs cannot monopolise this capacity. This involves enabling the Quality-of-Service features in cinder.

Enabling QoS is straight forward:
# cinder qos-create standard-iops consumer="front-end" read_iops_sec=400 write_iops_sec=200
# cinder qos-associate 10a7b93c-38d7-4061-bfb8-78d01e2fe6d8 c6ad034a-5d97-443b-97c6-58a8744bf99b
If you want to add additional limits:
# cinder qos-key 10a7b93c-38d7-4061-bfb8-78d01e2fe6d8 set read_bytes_sec=80000000
# cinder qos-key 10a7b93c-38d7-4061-bfb8-78d01e2fe6d8 set write_bytes_sec=40000000
# cinder qos-list
For the QoS parameters to be activated for an existing volume, you need to detach and reattach the old volumes.

This information is collected from the cloud team from CERN and a big thank you to the Ceph team for the help.

Monday, 24 February 2014

Our Cloud in Havana


TL;DR Upgrading a nearly 50,000 core cloud from Grizzly to Havana can be done with a series of steps, each of which can have short periods of reduced functionality but with constant VM availability.

At CERN, we started our production cloud service on Grizzly in July 2013. The previous OpenStack clouds had been pre-production environments with a fixed lifetime (i.e. they were available for use with an end date to be announced where the users would move to the new version via re-creating instances with tools such as Puppet or snapshot/upload instances).

With the Grizzly release, we made the service available with an agreement to upgrade in place rather than build anew. This blog details our experiences moving to the Havana release.

High Level Approach

We took a rolling upgrade approach, component by component, depending on need and complexity.
Following the release notes and operations guide, the order we chose was the following
  1. Ceilometer
  2. Glance
  3. Keystone
  4. Cinder
  5. Client CLIs
  6. Horizon
  7. Nova
The CERN cloud is based on the RDO distribution (see http://openstack.redhat.org). The majority of the servers are running Scientific Linux 6.5 with hypervisors running KVM. We follow a multi-hyper-visor approach so we also have Windows 2012 R2 with Hyper-V. The cloud databases are using MySQL.

Other configurations and different sites may see other issues than listed here so please check that this approach is appropriate for your environment before execution.

Ceilometer

While the cloud was in production in July, we had problems getting ceilometer to work well with cells and with the number of hypervisors we have (over 2,000). Thus, we chose to upgrade early to Havana as this provided much of the functionality we needed and avoided needing to backport.

Havana Ceilometer worked well with Grizzly Nova and allowed us to progress further with detailed metering of our cloud. We are still needing the patch (which has subsequently been included in Havana 2013.2.2 stable release after we upgraded).

Glance

The CERN Glance environment is backed by Ceph. We run multiple glance servers behind an HA Proxy load balancer.

For the upgrade, we
  • Stopped the Glance service at 16h00
  • Performed the database upgrade steps
  • Installed the Havana packages in all top and cell controllers
  • Glance was re-enabled at 16h45
One issue was spotted where access to non-public images was possible when using the nova image-list command. The images were not visible when using glance image-list. As a mitigation, access to the Nova image API was blocked and users were told to use the glance command (as was already recommended).

The root cause was related to the bug https://bugs.launchpad.net/glance/+bug/1152716. The fix for the problem was to add the policy statement for the parameter context_is_admin which is needed to limit access to images for projects.

Keystone

The overall approach taken was to try an online upgrade. We have a keystone in each of our cells (so we can talk to each of them independently of the top level cell API nodes) so these were good candidates to upgrade first. We use Active Directory for the user credentials so the Keystone service itself has a limited amount of state only (related to EC2 credentials and token management).

There were some significant additional functionalities such as tokens based on PKI (to allow validating tokens without calling home) and the V3 API which adds lots of functionality we are interested in such as Domains. We chose to take the incremental small step approach of migrating with the same function and then enabling additional function once the code had been deployed.

For the main keystone instance, we were aiming to benefit from the load balancing layer and that there were minimal database changes in Keystone. The largest problem foreseen was the EC2 credentials since these, in Grizzly, are stored in a different column to those in Havana (just Credentials). These columns had to be kept in sync with a script during the phase where both versions were running.

In practice, we performed the upgrade more rapidly than originally planned as some of the clients gave errors when they had received a token from a Grizzly keystone and were authenticating against a Havana one. Thus, we upgraded all the keystone servers as soon as the final functional tests in the production environment were completed.

The detailed error message of the error on a Grizzly client was

2014-01-23 09:10:02    ERROR [root] 'token'
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/keystone/common/wsgi.py", line 265, in __call__
    result = method(context, **params)
  File "/usr/lib/python2.6/site-packages/keystone/token/controllers.py", line 541, in validate_token
    self._assert_default_domain(context, token_ref)
  File "/usr/lib/python2.6/site-packages/keystone/token/controllers.py", line 482, in _assert_default_domain
    if (token_ref['token_data']['token']['user']['domain']['id'] !=
KeyError: 'token'

The solution was to complete the upgrade to Keystone on all the servers.

During the upgrade, we also noticed a database growth from expired tokens without a purge from the past few months which is now possible to manage using keystone-manage token-flush (
https://blueprints.launchpad.net/keystone/+spec/keystone-manage-token-flush-periodically). This is not a specific upgrade problem but it is worth doing to keep the database manageable.

Cinder

We had backported some Cinder cells functionality to Grizzly so that we could launch our Ceph volume storage function before we migrated to Havana. CERN currently runs a 3.5 PB Ceph service including OpenStack images, OpenStack Volumes and several other projects working with the object and block interfaces into Ceph. We had to take a careful migration approach for Cinder to protect the additional columns which were not in the standard Grizzly tables.

The impact for the user community was that volume creation/deletion was not possible for around 15 minutes during the upgrade. Existing VMs with attached storage continued to run without problems.

Following the standard upgrade procedure of
  • save a list of all volumes before starting using nova volume-list --all-t
  • stop the daemons for cinder
  • Perform full DB backup of cinder
  • update all RPMs
  • cinder-manage db sync
  • restart daemons
  • check new nova volume-list --all-t to see it matches
One minor issue we found was on the client side. Since the nova client depends on the cinder client, we had to upgrade the nova client to get to the latest cinder client version. With the nova client package being backwards compatible, this was not an issue but forced an earlier than planned update of nova client.


Client CLIs

We use the RDO client packages on our Linux machines. The upgrade was a change of repository and yum update.

Likewise, the Windows and Mac clients were upgraded in a similar way using pip. 

Horizon

We have customised Horizon slightly to add some basic features (add help and subscribe buttons to the login page, disable some features which aren't currently supported on the CERN cloud such as security groups). These changes were repeated.

The migration approach was to set up the Horizon web server on a new machine and test out the functionality. Horizon is very tolerant of different versions of code, including partial upgrades such as the CERN environment with Nova downlevel. We found a minor bug with the date picker when using Firefox which we'll be reporting upstream.

Once validated, the alias and virtual host definition in Apache were changed to point to the new machine.

Once under production load, however, we started to see stability issues with the Apache server and large numbers of connections. The users saw authentication problems on login and HTTP 500 errors. In the short term, we put a web server restart on a regular basis to allow us to analyse the problem. With memcached looking after the session information, this was a work around we could use without user disruption but is not a long term solution (from the comments below, this is a reported bug at https://bugs.launchpad.net/python-novaclient/+bug/1247056).

Nova

The Nova migration testing took the longest time for several reasons.

It is our most complex configuration with many cells in two data centres, in Geneva and Budapest. With nearly 50,000 cores and thousands of hypervisors, there is a lot of machines to update.


We have customised Nova to support the CERN legacy network management system. This is a CERN specific database which contains the hostnames, MAC addresses and IPs that is kept up to date by the nova network component.

To test, we performed an offline database copy and stepped through the migration scripts in a cloned environment. The only problems encountered were due to the local tables we had added. We performed a functional and stress test of the environment. With the dynamic nature of the cloud at CERN, we wanted to be sure that there would not be regression in areas such as performance and stability.

During the testing, we found some minor problems
  • euca2ools did not work on our standard SL 6 configuration (version 2.1.4). An error instance-type should be of type string was raised on VM creation. The root cause was a downlevel boto version (see bugzilla ticket)
  • When creating multiple machines using --num-instances on the nova command line with the cells configuration, the unique name created was invalid. A launchpad report was raised but this was a non-blocking issue as the feature is not used often.
We took a very conservative approach to ensure clear steps and post step checkout validation. We had investigated doing the migration cell by cell but we took a cautious approach for our first upgrade. For the actual migration, the steps were as follows:
  • Disable Puppet automatic software updating so we would be in control of which upgrades occurred when.
  • Move the Havana repository from the test/QA environment to the master branch in Puppet
  • Run "yum clean --all" on all the nodes with mcollective 
  • Stop Puppet running on all nodes
  • Starting with the top API controllers, then the cell controllers and finally the compute nodes
    • Block requests to stop new operations arriving
    • Disable automatic restart of daemons by the monitoring system
    • Stop the daemons and disable RabbitMQ
  • Back up all DBs and create a local dump of each in case we need a rapid restore
  • Update all the Nova RPMs on top controllers
  • Update the top DB
  • Reboot to get the latest Linux kernel
  • Update all the Nova RPMs on the cell controller
  • Update the cell DB
  • Reboot for new kernel
  • Update the RPMs on the compute nodes
  • Enable RabbitMQ
  • Enable services on top controllers
  • Enable services in child cell controllers
  • Enable nova-compute
  • Check out the service communications
  • Update the compute nodes using mcollective
  • Enable monitoring/exceptions on all nodes
  • Enable Puppet on all nodes
  • Perform check out tests
  • Enable user facing APIs at the top cells
Total time to run these steps was around 6 hours with the VMs running throughout. The databases back up, cloning and final migration script testing on the production data took a couple of hours. The software upgrade steps were also significant, checking RPMs are deployed across thousands of hypervisors is a lengthy process. Although there were a lot of steps, each one can be performed and checked out before continuing.

Post Production Issues

  • The temporary table containing quota cache data was cleaned as part of the upgrade but does not appear to be 100% recreated. https://bugs.launchpad.net/nova/+bug/1245746 seems to describe the problem. Logging in with the dashboard fixes the problem in most cases.

Conclusions

Upgrading OpenStack in production needs some careful planning and testing but it is a set of standard upgrade steps. There are lots of interesting features in Havana to explore for both the operations team and the end users of the CERN cloud.

Credits

This information is collected from the cloud team from CERN which performed the upgrade (Belmiro, Jose, Luis, Marcos, Thomas, Stefano) while also providing user support and adding new hardware.