LHC Tunnel

LHC Tunnel

Wednesday, 21 February 2018

Maximizing resource utilization with Preemptible Instances


The CERN cloud consists of around 8,500 hypervisors providing over 36,000
virtual machines. These provide the compute resources for both the laboratory's
physics program but also for the organisation's administrative operations such
as paying bills and reserving rooms at the hostel.

The resources themselves are generally ordered once to twice a year with servers being kept for around 5 years. Within the CERN budget, the resource planning teams looks at:
  • The resources required to run the computing services requirements for the CERN laboratory. These are projected using capacity planning trend data and upcoming projects such as video conferencing.
With the installation and commissioning of thousands of servers concurrently
(along with their associated decommissioning 5 years later), there are scenarios
to exploit underutilised servers. Programs such as LHC@Home are used but we have also been interested to expand the cloud to provide virtual machine instances which can be rapidly terminated in the event of
  • Resources being required for IT services as they scale out for events such as a large scale web cast on a popular topic or to provision instances for a new version of an application.
  • Partially full hypervisors where the last remaining cores are not being requested (the Tetris problem).
  • Compute servers at the end of their lifetime which are used to the full before being removed from the computer centre to make room for new deliveries which are more efficient and in warranty.
The characteristics of this workload is that it should be possible to stop an
instance within a short time (a few minutes) compared to a traditional physics job.

Resource Management In Openstack

Operators use project quotas for ensuring the fair sharing of their infrastructure. The problem with this, is that quotas pose as hard limits.This
leads to actually dedicating resources for workloads even if they are not used
all the time or to situations where resources are not available even though
there is quota still to use.

At the same time, the demand for cloud resources is increasing rapidly. Since
there is no cloud with infinite capabilities, operators need a way to optimize
the resource utilization before proceeding to the expansion of their infrastructure.

Resources in idle state can occur, showing lower cloud utilization than the full
potential of the acquired equipment while the users’ requirements are growing.

The concept of Preemptible Instances can be the solution to this problem. These
type of servers can be spawned on top of the project's quota, making use of the
underutilised  capabilities. When the resources are requested by tasks with
higher priority (such as approved quota), the preemptible instances are
terminated to make space for the new VM.

Preemptible Instances with Openstack

Supporting preemptible instances, would mirror the AWS Spot Market and the
Google Preemptible Instances. There are multiple things to be addressed here as
part of an implementation with OpenStack, but the most important can be reduced to these:
  1. Tagging Servers as Preemptible
In order to be able to distinguish between preemptible and non-preemptible
servers, there is the need to tag the instances at creation time. This property
should be immutable for the lifetime of the servers.
  1. Who gets to use preemptible instances
There is also the need to limit which user/project is allowed to use preemptible
instances. An operator should be able to choose which users are allowed to spawn this type of VMs.
  1. Selecting servers to be terminated
Considering that the preemptible instances can be scattered across the different cells/availability zones/aggregates, there has to be “someone” able to find the existing instances, decide the way to free up the requested resources according to the operator’s needs and, finally, terminate the appropriate VMs.
  1. Quota on top of project’s quota
In order to avoid possible misuse, there could to be a way to control the amount of preemptible resources that each user/project can use. This means that apart from the quota for the standard resource classes, there could be a way to enforce quotas on the preemptible resources too.

OPIE : IFCA and Indigo Dataclouds

In 2014, there were the first investigations into approaches by Alvaro Lopez
from IFCA (https://blueprints.launchpad.net/nova/+spec/preemptible-instances).
As part of the EU Indigo Datacloud project, this led to the development of the
OpenStack Pre-Emptible Instances package (https://github.com/indigo-dc/opie).
This was written up in a paper to Journal of Physics: Conference Series
(http://iopscience.iop.org/article/10.1088/1742-6596/898/9/092010/pdf) and
presented at the OpenStack summit (https://www.youtube.com/watch?v=eo5tQ1s9ZxM)

Prototype Reaper Service

At the OpenStack Forum during a recent OpenStack summit, a detailed discussion took place on how spot instances could be implemented without significant changes to Nova. The ideas were then followed up with the OpenStack Scientific Special Interest Group.

Trying to address the different aspects of the problem, we are currently
prototyping a “Reaper” service. This service acts as an orchestrator for
preemptible instances. It’s sole purpose is to decide the way to free up the
preemptible resources when they are requested for another task.

The reason for implementing this prototype, is mainly to help us identify
possible changes that are needed in Nova codebase to support Preemptible

More on this WIP can be found here: 


The concept of Preemptible Instances gives operators the ability to provide a
more "elastic" capacity. At the same time, it enables the handling of increased
demand for resources, with the same infrastructure, by maximizing the cloud

This type of servers is perfect for tasks/apps that can be terminated at any
time, enabling the users to take advantage of extra cpu power on demand without the fixed limits that quotas enforce.

Finally, here in CERN, there is an ongoing effort to provide a prototype
orchestrator for Preemptible Servers with Openstack, in order to pinpoint the
changes needed in Nova to support this feature optimally. This could also be
available in future for other OpenStack clouds in use by CERN such as the
T-Systems Open Telekom Cloud through the Helix Nebula Open Science Cloud


  • Theodoros Tsioutsias (CERN openlab fellow working on Huawei collaboration)
  • Spyridon Trigazis (CERN)
  • Belmiro Moreira (CERN)


Tuesday, 30 January 2018

Keep calm and reboot: Patching recent exploits in a production cloud

At CERN, we have around 8,500 hypervisors running 36,000 guest virtual machines. These provide the compute resources for both the laboratory's physics program but also for the organisation's administrative operations such as paying bills and reserving rooms at the hostel. These resources are spread over many different server configurations, some of them over 5 years old.

With the accelerator stopping over the CERN annual closure until mid March, this is a good period to be planning reconfiguration of compute resources such as the migration of our central batch system which schedules the jobs across the central compute resources to a new system based on HTCondor. The compute resources are heavily used but there is more flexibility to drain some parts in the quieter periods of the year when there is not 10PB/month coming from the detectors. However, this year we have had an unexpected additional task to deploy the fixes for the Meltdown and Spectre exploits across the centre.

The CERN environment is based on Scientific Linux CERN 6 and CentOS 7. The hypervisors are now entirely CentOS 7 based with guests of a variety of operating systems including Windows flavors and CERNVM. The campaign to upgrade involved a number of steps
  • Assess the security risk
  • Evaluate the performance impact
  • Test the upgrade procedure and stability
  • Plan the upgrade campaign
  • Communicate with the users
  • Execute the campaign

Security Risk

The CERN environment consists of a mixture of different services, with thousands of projects on the cloud, distributed across two data centres in Geneva and Budapest. 

Two major risks were identified
  • Services which provided the ability for end users to run their own programs along with others sharing the same kernel. Examples of this are the public login services and batch farms. Public login services provide an interactive Linux environment for physicists to log into from around the world, prepare papers, develop and debug applications and submit jobs to the central batch farms. The batch farms themselves provide 1000s of worker nodes processing the data from CERN experiments by farming event after event to free compute resources. Both of these environments are multi-user and allow end users to compile their own programs and thus were rated as high risk for the Meltdown exploit.
  • The hypervisors provide support for a variety of different types of virtual machines. Different areas of the cloud provide access to different network domains or to compute optimised configurations. Many of these hypervisors will have VMs owned by different end users and therefore can be exposed to the Spectre exploits, even if the performance is such that exploiting the problem would take significant computing time.
The remaining VMs are for dedicated services without access for end user applications or dedicated bare metal servers for I/O intensive applications such as databases and disk or tape servers.

There are a variety of different hypervisor configurations which we split down by processor type (in view of the Spectre microcode patches). Each of these needs independent performance and stability checks.

Processor name(s)
E5-2630 v3 @ 2.40GHz,E5-2640 v3 @ 2.60GHz
E5-2630 v4 @ 2.20GHz, E5-2650 v4 @ 2.20GHz
E5-2650 v2 @ 2.60GHz
CPU family: 21 Model: 1 Model name: AMD Opteron(TM) Processor 6276 Stepping: 2
E5-2630L 0 @ 2.00GHz, E5-2650 0 @ 2.00GHz
E5645 @ 2.40GHz, L5640 @ 2.27GHz, X5660 @ 2.80GHz

These risks were explained by the CERN security team to the end users in their regular blogs.

Evaluating the performance impact

The High Energy Physics community uses a suite called HEPSPEC06 to benchmark compute resources. These are synthetic programs based on the C++ components of SPEC CPU2006 which match the instruction mix of the typical physics programs. With this benchmark, we have started to re-benchmark (the majority of) the CPU models we have in the data centres, both on the physical hosts and on the guests. The measured performance loss across all architectures tested so far is about 2.5% in HEPSPEC06 (a number also confirmed by by one of the LHC experiments using their real workloads) with a few cases approaching 7%. So for our physics codes, the effect of patching seems measurable, but much smaller than many expected. 

Test the upgrade procedure and stability

With our environment based on CentOS and Scientific Linux, the deployment of the updates for Meltdown and Spectre were dependent on the upstream availability of the patches. These could be broken down into several parts
  • Firmware for the processors - the microcode_ctl packages provide additional patches to protect against some parts of Spectre. This package proved very dynamic as new processor firmware was being added on a regular basis and it was not always clear when this needed to be applied, the package version would increase but it was not always that this included an update for the particular hardware type. Following through the Intel release notes,  there were combinations such as "HSX C0(06-3f-02:6f) 3a->3b" which explains that the processor description 06-3f-02:6f is upgraded from release 0x3a to 0x3b. The fields are the CPU family, model and stepping from /proc/cpuinfo and the firmware level can be found at /sys/devices/system/cpu/cpu0/microcode/version. A simple script (spectre-cpu-microcode-checker.sh) was made available to the end users so they could check their systems and this was also used by the administrators to validate the central IT services.
  • For the operating system, we used a second script (spectre-meltdown-checker.sh) which was derived from the upstream github code at https://github.com/speed47/spectre-meltdown-checker.  The team maintaining this package were very responsive incorporating our patches so that other sites could benefit from the combined analysis.

Communication with the users

For the cloud, there are several resource consumers.
  • IT service administrators who provide higher level functions on top of the CERN cloud. Examples include file transfer services, information systems, web frameworks and experiment workload management systems. While some are in the IT department, others are representatives of their experiments or supporters for online control systems such as those used to manage the accelerator infrastructure.
  • End users consume cloud resources by asking for virtual machines and using them as personal working environments. Typical cases would be a MacOS user who needs a Windows desktop where they would create a Windows VM and use protocols such as RDP to access it when required.
The communication approach was as follows:
  • A meeting was held to discuss the risks of exploits, the status of the operating systems and the plan for deployment across the production facilities. With a Q&A session, the major concerns raised were around potential impact on performance and tuning options. 
  • An e-mail was sent to all owners of virtual machine resources informing them of the upcoming interventions.
  • CERN management was informed of the risks and the plan for deployment.
CERN uses ServiceNow to provide a service desk for tickets and a status board of interventions and incidents. A single entry was used to communicate the current plans and status so that all cloud consumers could go to a single place for the latest information.

Execute the campaign

With the accelerator starting up again in March and the risk of the exploits, the approach taken was to complete the upgrades to the infrastructure in January, leaving February to find any residual problems and resolve them. As the handling of the compute/batch part of the infrastructure was relatively straight forward (with only one service on top), we will focus in the following on the more delicate part of hypervisors running services supporting several thousand users in their daily work.

The layout of our infrastructure with its availability zones (AVZs) determined the overall structure and timeline of the upgrade. With effectively four AVZs in our data centre in Geneva and two AVZs for our remote resources in Budapest, we scheduled the upgrade for the services part of the resources over four days.

The main zones in Geneva were done one per day, with a break after the first one (GVA-A) in case there were unexpected difficulties to handle on the infrastructure or on the application side. The remaining zones were scheduled on consecutive days (GVA-B and GVA-C), the smaller ones (critical, WIG-A, WIG-B) in sequential order on the last day. This way we upgraded around 400 hosts with 4,000 guests per day.

Within each zone, hypervisors were divided into 'reboot groups' which were restarted and checked before the next group was handled. These groups were determined by the OpenStack cells underlying the corresponding AVZs. Since some services required to limit the window of service downtime, their hosting servers were moved to the special Group 1, the only one for which we could give a precise start time.

For each group several steps were performed:
  • install all relevant packages
  • check the next kernel is the desired one
  • reset the BMC (needed for some specific hardware to prevent boot problems)
  • log the nova and ping state of all guests
  • stop all alarming 
  • stop nova
  • shut down all instances via virsh
  • reboot the hosts
  • ... wait ... then fix hosts which did not come back
  • check running kernel and vulnerability status on the rebooted hosts
  • check and fix potential issues with the guests
Shutting down virtual machines via 'virsh', rather than the OpenStack APIs, was chosen to speed up the overall process -- even if this required to switch off nova-compute on the hosts as well (to keep nova in a consistent state). An alternative to issuing 'virsh' commands directly would be to configure 'libvirt-guests', especially in the context of the question whether guests should be shut down and rebooted (which we did during this campaign) or paused/resumed. This is an option we'll have a look at to prepare for similar campaigns in the future.

As some of the hypervisors in the cloud had very long uptimes and this was the first time we systematically rebooted the whole infrastructure since the service went to full production about five years ago, we were not quite sure what kind issues to expect -- and in particular at which scale. To our relief, the problems encountered on the hosts hit less than 1% of the servers and included (in descending order of appearance)
  • hosts stuck in shutdown (solved by IPMI reset)
  • libvirtd stuck after reboot (solved by another reboot)
  • hosts without network connectivity (solved by another reboot)
  • hosts stuck in grub during boot (solved by reinstalling grub) 
On the guest side, virtual machines were mostly ok when the underlying hypervisor was ok as well.
A few additional cases included
  • incomplete kernel upgrades, so the root partition could not be found (solved by booting back into an older kernel and reinstall the desired kernel)
  • file system issues (solved by running file system repairs)
So, despite initial worries, we hit no major issues when rebooting the whole CERN cloud infrastructure!


While these kind of security issues do not arrive very often, the key parts of the campaign follow standard steps, namely assessing the risk, planning the update, communicating with the user community, execution and handling incomplete updates.

Using cloud availability zones to schedule the deployment allowed users to easily understand when there would be an impact on their virtual machines and encourages good practise to load balance resources.



  • Arne Wiebalck
  • Jan Van Eldik
  • Tim Bell

Wednesday, 30 August 2017

Scheduled snapshots

While most of the machines on the CERN cloud are configured using Puppet with state stored in external databases or file stores, there are a few machines where this has been difficult, especially for legacy applications.

Doing a regular snapshot of these machines would be a way of protecting against failure scenarios such as hypervisor failure or disk corruptions.

This could always be scripted by the project administrator using the standard functions in the openstack client but this would also involve setting up the schedules and the credentials externally to the cloud along with appropriate skills for the project administrators. Since it is a common request, the CERN cloud investigated how this could be done as part of the standard cloud offering.

The approach that we have taken uses the Mistral project to execute the appropriate workflows at a scheduled time. The CERN cloud is running a mixture of OpenStack Newton and Ocata but we used the Mistral Pike release in order to have the latest set of fixes such as in the cron triggers. With the RDO packages coming out in the same week as the upstream release, this avoided doing an upgrade later.

Mistral has a set of terms which explain the different parts of a workflow (https://docs.openstack.org/mistral/latest/terminology).

The approach needed several steps
  • Mistral tasks to define the steps
  • Mistral workflows to provide the order to perform the steps in
  • Mistral cron triggers to execute the steps on schedule

Mistral Workflows

The Mistral workflows consist of a set of tasks and a process which decides which task to execute next based on different branch criteria such as success of a previous task or the value of some cloud properties.

Workflows can be private to the project, shared or public. By making these scheduled snapshot workflows public, the cloud administrators can improve the tasks incrementally and the cloud projects will receive the latest version of the workflow next time they execute them. With the CERN gitlab based continuous integration environment, the workflows are centrally maintained and then pushed to the cloud when the test suites have completed successfully.

The following Mistral workflows were defined


Virtual machines can be snapshotted so that a copy of the virtual machine is saved and can be used for recovery or cloning in future. The instance_snapshot workflow performs this operation for both virtual machines which have been booted from volume or locally.

The name of the instance to be snapshot
The name of the snapshot to store. The text ={0}= is replaced by the instance name and the text ={1}= is replaced by the date in the format YYYYMMDDHHMM.
The number of snapshots to keep. Older snapshots are cleaned from the store when new ones are created.
0 (i.e. keep all)
Only complete the workflow when the steps have been completed and the snapshot is stored in the image storage
Shut the instance down before snapshotting and boot it up afterwards.
false (i.e. do not stop the instanc)
e-mail address to send the report if the workflow is successful
null (i.e. no mail sent)
e-mail address to send the report if the workflow failed
null (i.e. no mail sent)

The steps for this workflow are described in the detail in the YAML/YAQL files at https://gitlab.cern.ch/cloud-infrastructure/mistral-workflows.

The operation is very fast with Ceph based boot-from-volumes since the snapshot is done within Ceph. It can however take up to a minute for locally booted VMs while the hypervisor is ensuring the complete disk contents are available. The VM is resumed and the locally booted snapshot is then sent to Glance in the background.

The high level steps are

·      Identify server
·      Stop instance if requested by instance_stop
·      If the VM is locally booted
o   Snapshot the instance
o   Clean up the oldest image snapshot if over max_snapshots
·      If the VM is booted from volume
o   Snapshot the volume
o   Cleanup oldest volume snapshot if over max_snapshots
·      Start instance if requested by instance_stop
·      If there is an error and to_addr_error is set
o   Send an e-mail to to_addr_error
·      If there is no error and to_addr_success is set
o   Send an e-mail to to_addr_success

For applications which are not highly available, a common configuration is using a LanDB alias to a particular VM. In the event of a failure, the VM can be cloned from a snapshot and the LanDB alias updated to reflect the new endpoint location for the service. This workflow will create a volume if the source instance is booted from volume. The workflow is called restore_clone_snapshot.

The source instance needs to be still defined since information such as the properties, flavor and availability zone are not included in the snapshot and these are propagated by default.

The name of the instance from which the snapshot will be cloned
The date of the snapshot to clone (either YYYYMMDD or YYYYMMDDHHMM)
The name of the snapshot to clone. The text ={0}= is replaced by the instance name and the text ={1}= is replaced by the date.
The name of the new instance to be created
The availability zone to create the clone in.
Same as the source instance
The flavour for the cloned instance
Same as the source instance
The properties to copy to the new instance
All properties are copied from the source[1]
Only complete the workflow when the steps have been completed and the cloned VM is running
e-mail address to send the report if the workflow is successful
null (i.e. no mail sent)
e-mail address to send the report if the workflow failed
null (i.e. no mail sent)

Thus, cloning the machine timbfvlinux143 to timbfvclone143 requires running the workflow with the parameters

{“instance”: “timbfvlinux143”, “clone_name”: “timbfvclone143”, “date”: “20170830” }

This results in

·      A new volume created from the snapshot timbfvlinux143_snapshot_20170830
·      A new VM is created called timbfvclone143 booted from the new volume

An instance clone can be run for VMs which are booted from volume even when the hypervisor is not running. A machine can then be recovered from it's current state using the procedure

·      Instance snapshot of original machine
·      Instance clone from that snapshot (using today's date)
·      If DNS aliases are used, the alias can then be updated to point to the new instance name

For Linux guests, the rename of the hostname to the clone name occurs as the machine is booted. In the CERN environment, this took a few minutes to create the new virtual machine and then up to 10 minutes to wait for the DNS refresh.

For Windows guests, it may be necessary to refresh the Active Directory information given the change of hostname.

In the event of an issue such as a bad upgrade, the administrator may wish to roll back to the last snapshot. This can be done using the restore_inplace_snapshot workflow.

This operation works for locally booted machines, maintains the IP and MAC address but cannot be used if the hypervisor is down. It does not currently work for boot from volume until the revert to snapshot (available in Pike from https://specs.openstack.org/openstack/cinder-specs/specs/pike/cinder-volume-revert-by-snapshot.html) is in production.

The name of the instance from which the snapshot will be replaced
The date of the snapshot to replace from (either YYYYMMDD or YYYYMMDDHHMM)
The name of the snapshot to replace from. The text ={0}= is replaced by the instance name and the text ={1}= is replaced by the date.
Only complete the workflow when the steps have been completed and the replaced VM is running
e-mail address to send the report if the workflow is successful
null (i.e. no mail sent)
e-mail address to send the report if the workflow failed
null (i.e. no mail sent)

Mistral Cron Triggers
Mistral has another nice feature where it is able to run a workflow at regular intervals. Compared to standard Unix cron, the Mistral cron triggers use Keystone trusts to save the user token when the trigger is enabled. Thus, the execution is able to run without needing the credentials such as a password or valid Kerberos token.
The steps are as follows to create a cron trigger via Horizon or the CLI.
The name of the cron trigger
Nightly Snapshot
Workflow ID
The name or UUID of the workflow
A JSON dictionary of the parameters
{“instance”: “timbfvlinux143”, “max_snapshots”: 5, “to_addr_error”: “theadmin@cern.ch”}
A cron schedule pattern according to http://en.wikipedia.org/wiki/Cron
* 5 * * * (i.e. run daily at 5a.m.)

This will then execute the instance snapshot at 5a.m. sending a mail to theadmin@cern.ch in the event of a failure of the snapshot. 5 past copies will be kept.

Mistral Executions
When Mistral runs a workflow, it provides details of the steps executed, the timestamps for start and end along with the results. Each step can be inspected individually as part of debugging and root cause analysis in the event of failures.
The Horizon interface gives an easy interface for selecting the failing tasks. There may be tasks reported as ‘error’ but these steps can then have subsequent actions which succeed so an error step may be a normal part of a successful task execution such as using a default if no value can be found.


  • Jose Castro Leon from the CERN IT cloud team did the implementation of the Mistral project and the workflows described.

[1] Except for a CERN specific one called landb-alias for a DNS alias