LHC Tunnel

LHC Tunnel

Monday, 9 December 2013

Swiss and Rhone Alpes User Group Meeting at CERN

A combined meeting of the Swiss OpenStack and Rhone Alpes OpenStack user groups was held at CERN on Friday 6th December 2013. This is the 6th Swiss User Group meeting and the 2nd Rhone Alpes one.

Despite the cold and a number of long distance travelers, 85 people from banking, telecommunications, Academia/Research and IT companies gather to share OpenStack experiences across the region and get the latest news.

The day was split into two parts. In the morning, there was the opportunity for a visit to the ATLAS experiment, see a 3-D movie on how the experiment was built and visit the CERN public exhibition points at the Globe and Microcosm. Since the Large Hadron Collider is currently under going maintenance until 2015, the experimental areas are accessible for small groups with a guide.

At the ATLAS control room, you could see a model of the detector


The real thing is slightly bigger and heavier... 100m underground, only one end is accessible currently from the viewing gallery.

The afternoon session presentations are available at the conference page.

After a quick introduction, I gave feedback on the latest board meeting in Hong Kong with topics such as the election process and the defcore discussion to answer the "What is OpenStack" question.

The following talk was an set of technical summaries of the Hong Kong summit from Belmiro, Patrick and Gergely. Belmiro covered the latest news on Nova, Glance and Cinder along with some slides from his deep dive talk on CERN's openstack cloud.


Patrick from Bull's xlcloud covered the latest news on Heat which is rapidly becoming a key part of the OpenStack environment as it not only provides a user facing service for orchestration but also is now a pre-requisite for other projects such as Trove, the database as a service.


In a good illustration of the difficulties of compatibility, the open office document failed to display the key slide on PowerPoint but Patrick covered the details while the PDF version was brought up. Heat currently supports AWS Cloud Formations but is now adding a native template language, HOT, to cover additional functions. The icehouse release will add more auto scaling features, integration with ceilometer and a move towards the TOSCA standard.

Gregely covered some of the user stories and the latest news on ceilometer as it starts to move into alarming on top of the existing metering function.


Alessandro then covered the online clouds at CERN which are opportunistically using the 1000s of servers attached to the CMS and ATLAS experiments when the LHC is not running.  The aim is to be able to switch as fast as possible from the farm being used to filter the 1PB/s from the LHC to performing physics work. Current tests show it takes around 45 minutes to intantiate the VMs on the 1,400 hypervisors.


Jens-Christian gave a talk on the use of Ceph at SWITCH. Many of the aspects seem similar to the block storage that we are looking at within CERN's cloud. SWITCH are aiming a 1,000 core cluster to serve the Swiss academic community including dropbox, IaaS and app-store style services. It was particularly encouraging to see that SWITCH have been able to perform online upgrades between versions without problems.... the regular warning to be cautious with CephFS is also made, so using rbd rather than filesystem backed storage makes sense.


Martin gave us a detailed view on the options for geographically distributed clouds using OpenStack. This was intriguing on multiple levels in view of CERN's ongoing work with the community on federated identity along with some useful hints and tips on the different kinds of approaches. Martin converged onto using Regions to achieve the ultimate goals but there were several potential useful intermediate configurations such as cells which CERN is using extensively in the multi-data centre cloud with Budapest and Meyrin. I fully agree with Martin's perspective on the need for cells to become more than just a nova concept as we require similar functions in Glance and Cinder for the CERN use case. Martin had given the same talk in Paris on Thursday and was giving it again on Monday in Israel so he is doing a fine job in re-use of presentations.


Sergio described Elastic Cluster which is an open source tool for provisioning clusters for researchers. He illustrated the function with a youtube video which demonstrates the scaling of the cluster on top of a cloud infrastructure.


Finally, Dave Neary gave an introduction to Open Shift and how to deploy a Platform-as-a-Service solution. Using a simple git/ssh model, various PaaS instances can be deployed easily on top of an OpenStack cloud. The talk included a memorable demo where Dave showed the depth of his ruby skills but was rescued by the members of the audience and the application deployed to rounds of applause.


Many thanks to the CERN guides and administration for helping with the organisation, all of the attendees for coming and making it such a lively meeting and to Sven Michels and Belmiro Morerira for the photos.

Thursday, 17 October 2013

Log handling and dashboards in the CERN cloud

At CERN, we developed a fabric monitoring tool called Lemon around 10 years ago while the computing infrastructure for the Large Hadron Collider was being commissioned.

As in many areas, modern Open Source tools are now providing equivalent function without the effort of maintaining the entire software stack and benefiting from the experience of others.

As part of the ongoing work to align the CERN computing fabric management to open source tools with strong community support, we have investigated a monitoring tool chain which would allow us to consolidate logs, perform analysis for trends and provide dashboards to assist in problem determination.

Given our scalability objectives, we can expect O(100K) endpoints with multiple data sources to be correlated from PDUs, machines and guest instances.

A team looking into a CERN IT reference architecture have divided the problem into 4 areas

  • Transport - how to get the data from the client machine to where can be analysed with the aim of a low overhead per client. Flume was chosen as the implementation for this part after a comparison with ActiveMQ (which we use in other areas such as monitoring notifications). Flume was chosen as there were many pre-existing connectors for sources and sinks.
  • Tagging - where we can associate additional external metadata with the data so each entry is self describing. Elastic Search was the selection here.
  • Dashboard - where queries on criteria are displayed in an intuitive way (such as time series or pie charts) on a service specific dashboard. Kibana was the natural choice in view of the ease of integration with Elastic Search.
  • Long term data repository to allow offline analysis and trending when new areas of investigation are needed. HDFS was the clear choice for unstructured data and we had a number of other projects at CERN that require it's function too.

Using this architecture, we have implemented the CERN private cloud log handling, taking the logs from the 700 hypervisors and controllers around the CERN cloud and consolidating them into a set of dashboards.

Our requirements are
  • Have a centralized copy of our logs to ease problem investigation
  • Display OpenStack usage statistics such as management dashboards
  • Show the results of functional tests and probes to the production cloud
  • Maintain a long term history of the infrastructure status and evolution
  • Monitor the state of our databases
The Elastic Search configuration is a 14 node cluster with 11 data nodes and 3 HTTP nodes configured using Puppet and running, naturally, on VMs in the CERN private cloud. We have 5 shares per index with 2 replicas per shard.

Kibana is running on 3 nodes, with Shibboleth authentication to integrate to the CERN Single Sign On system.

A number of dashboards have been created. A Nova API dashboard shows the usage

An Active User dashboard helps us to identify if there are heavy users causing disturbance.

The dashboards themselves can be created dynamically without needing an in-depth knowledge of the monitoring infrastructure.

This work has been implemented by the CERN monitoring and cloud teams and thanks for the support of the Flume, Elastic Search and Kibana communities.

Saturday, 21 September 2013

A tale of 3 OpenStack clouds : 50,000 cores in production at CERN

I had several questions at the recent GigaOm conference in London regarding the OpenStack environments at CERN. This blog explains the different instances and their distinct teams of evaluation and administration.

The CERN IT department provides services for around 11,000 physicists from around the world. Our central services provide the basic IT infrastructure that any organisation needs from e-mail, web services, databases and desktop support.

In addition, we provide computing resources for the physics analysis of the 35PB/year that comes from the Large Hadron Collider. The collider is currently being upgraded until 2015 in order to double the energy of the beams and enhance the detectors which are situated around the ring.

During this period, we have a window for significant change to the central IT services. While the analysis of the data from the first run of the LHC continues during this upgrade, the computing infrastructure in CERN IT is moving towards a cloud model based on popular open source technologies used outside of CERN such as Puppet and OpenStack.

As we started working with these communities through mailing lists and conferences, we encountered other people from High Energy Physics organisations around the world who were going through the same transition, such as IN2P3, Nectar, IHEP and Brookhaven. What was most surprising is when we found out that others on the CERN site were working on OpenStack and Puppet and that the CERN IT cloud was actually the smallest one!

The two largest experiments, ATLAS and CMS, both have large scale farms close to filter the 1PB/s of data from each detector before sending it to the CERN computer centre for recording and analysis.
These High Level Trigger farms are over 1000 servers, each typically 12 core systems. During the upgrade of the LHC, these servers would be idle as there are no collisions. However, the servers are attached to CERN's technical network, which is isolated from the rest of CERN's network as this network is used for systems which are closely associated with the accelerator and other online systems. Thus, they could not easily be used for running physics programs since the data is not accessible from the technical network.

In view of this, the approach taken was to use a cloud with the virtual machines not being able to access the technical network. This allows strong isolation and makes it very easy to start/stop large numbers of programs at a time. During CMS's tests, they were starting/stopping 250VMs in a 5 minute period.

For the software selection, each experiment makes its own choices independent of the CERN IT department. However, both teams selected OpenStack as their cloud software. The ATLAS cloud was set up with the help of Brookhaven National Laboratories who are also running OpenStack in their centre. The CMS cloud was set up and run by two CMS engineers . 

For configuration management, ATLAS online teams were already using Puppet and CMS migrated from Quattor to Puppet during the course of the project. This allowed them to use the Stackforge Puppet modules, as we do in the CERN IT department.

Both the experiment clouds are now in production, running physics simulation and some data analysis programs that can fit within the constraints of limited local storage and network I/O to the computer centre.

Thus, the current CERN cloud capacities are as follows.

Cloud Hypervisors Cores
28,800 (HT)
CMS OOOO cloud
CERN IT Ibex and Grizzly clouds
20,952 (HT)

This makes over 60,000 cores in total managed by OpenStack and Puppet at CERN.  With hyper-thread cores on both AMD and Intel CPUs, it is always difficult to judge exactly how much extra performance is achieved, thus my estimate of an effective delivery of 50,000 cores in total.

While the CERN IT cloud is currently only 20,000 cores, we are currently installing around 100 hypervisors a week in the two data centres in Geneva and Budapest so we would expect the cores in this area to significantly grow in the next 18 months as we are aiming for 300,000 cores in 2015.

Wednesday, 14 August 2013

Managing identities in the cloud

CERN has 11,000 physicists who use the lab's facilities including the central IT department resoures. As with any research environment, there are many students, PhDs and other project members who join one of the experiments at CERN. They need to have computing accounts to access CERN's cloud but we also need to make sure these resources are handled correctly when they are no longer affiliated with the organisation.

Managing Users

For the CERN OpenStack cloud, we wanted complete integration with the site identity management system. With around 200 arrivals/departures per month, managing identities within OpenStack would have been a major effort.

CERN's users are stored in our Active Directory system which provides a single central password and user attribute store such as full name, organisational unit and location. We also define our user groups using Active Directory so that lists of members of an experiment can be centrally managed and applications share this master source of data for allocating roles to user groups.

Keystone provides the OpenStack authentication service including an LDAP back end. Working with the community during the Folsom release, we developed a number of patches so that Keystone was able to use the LDAP interface to Active Directory (see http://docs.openstack.org/trunk/openstack-compute/admin/content/configuring-keystone-for-ldap-backend.html for details). This allows users from both the command line and Horizon GUI to use OpenStack with their standard credentials.

Where we can, we leave the LDAP schema read-only since there are many other dependencies and major schema changes can cause significant disruption.

Multiple Identities

Historically, users have multiple accounts at CERN.
  • Primary account which is used for their prime activity
  • Secondary accounts are used for cases where you wish a different identity. Typical examples are where an administrator would need an account that would provide standard user rights for documentation or an ultra account which is rarely used
  • Service accounts which are shared. Here the user is responsible for the account but is able to transfer the account to another user. Typical examples would be an account used for running a daemon or an application internal resource.
As examples, timbell (my primary account for my day to day work), timothybell (my secondary to simulate a typical low privilege user profile for documentation) and owncloud (a service account related to a specific application).

The structure of the cloud identities is such that we are aiming to use primary accounts and using roles within projects to reduce the need for secondary accounts. The project with multiple members to manage the project covers the service account scenario with respect to resources.

Thus, the cloud can potentially simplify both identity/roles and authentication by focusing on the one user, one account model. We expect exceptions but since one of the aims of the move to the cloud was to simplify our environment, we hope these can be limited to very special circumstances.

Managing Roles

We use the standard conventions for OpenStack roles.
  • Admin is a global role providing 'super-user' access to OpenStack. This is allocated to a group within Active Directory and the only members are the staff who support the cloud within IT.
  • For each project, there is a members list defined. When a project is set up, a group is provided as part of the request which defines the people who are able to perform actions within the project such as VM creation/deletion/reboot.
There is a regular script which ensures that the Active Directory groups are synchronised with those in Keystone.

User Lifecycle

With over 200 arrivals and departures every month, it is important to track the owner of resources to retire them when someone is no longer working on a CERN related activity. 

We use Microsoft Federated Identity Manager (FIM) as an engine to automatically create users when someone is registered in the CERN Human Resources database and to expire them as they leave.

Users who wish to use the cloud can subscribe via the CERN accounts and resources portal. This creates an account and a personal project for them in a few minutes so they can already start investigating cloud technologies.

The general approach is that personal resources (such as the Personal project in OpenStack) will be removed. VMs will be stopped and deleted. Departing users are removed from their roles. Ownership of shared resources, such as projects, can be transferred before leaving or are automatically passed to the supervisor.

With this lifecycle, the OpenStack resources follow that for other computing resources and there are no orphaned resources.

To allow FIM and OpenStack to integrate, we developed a service called Cornerstone which provides a SOAP interface for FIM such as create personal project, create shared project, etc. and then performs the automated operations behind the scenes.

One interesting issue was the propagation delays. When a new project is created in FIM, Active Directory is updated but there is a small delay before all the slaves of Active Directory are updated. Thus, for project creation, we use a single Active Directory server to receive the information to avoid inconsistency (at the expense of availability if AD is down). 


As we've rolled out Grizzly, there is now ongoing work on the CERN Grizzly OpenStack to enhance user access. Specifically,
  • Kerberos and X.509 certificates for user authentication are widely used in the High Energy Physics work. Kerberos is often used for interactive user authentication. X.509 certificates are also used for users but increasingly as a way to identify services such as automated job submission factories. Now that Keystone supports REMOTE_USER authentication, we can use the Apache kerberos and certificate authentication methods to front end the Keystone service. This will avoid having to source a profile and enter passwords.
  • Integration of CERN's web based Single Sign On is an attractive option for Horizon. While common passwords are used, the user of Horizon still needs to enter their password to get access to the dashboard. CERN uses Microsoft ADFS to provide a Single Sign On capability which is used for most web applications.
  • We have a team of system administrators who perform the standard operations tasks when there are alarms in our monitoring system. These sysadmins need to be able to start/stop/reboot instances across the cloud but not perform create/delete/... operations. We will investigate how to model this within the existing JSON policy files
A number of ongoing activities in Havana will make further integration easier:
  • The Keystone V3 API is coming along which will include additional functionality in the area of mapping groups to roles. We will investigate how to map OpenStack roles into Active Directory groups and thus avoid synchronisation scripts.
  • Domains will add an extra level of project handling allowing us to group projects together. This will also create the possibilities of a structured set of roles within our user communities.
We'll be participating in the Havana design discussions around these areas so that we can further streamline our user and identity management in future.

Wednesday, 7 August 2013

Flavors - An English perspective

OpenStack has the capability to define flavors of virtual machines, how many cores, swap, disk and memory.

As a native UK English speaker, I find the term flavor to already be a problem. My spell checker fixes it to Flavour and requires regular manual changes. I appeal to the OpenStack technical committee to not accept any term which is not the same in US and UK English or to allow an alias :-)

At CERN, we make available 4 standard flavors modeled on the Amazon ones. These common names are already familiar to public cloud users and allows some better compatibility with scripts using EC2.

System Disk

For most cases, this set allows us to cover the configurations physicists ask for. There are some inefficiencies which can occur if an app requires 8GB but not much CPU power, or needs an 80GB disk but not much memory. These can be addressed by some overcommitting.

Currently, we overcommit on CPU and also use SMT. The current configuration of hypervisors is 24 core (i.e. 48 core with SMT enabled), 96GB memory and 3 2TB disks. This matches the configuration of the above flavors for memory and CPU and produced a configuration which is around twice the per-core performance on Amazon when we compare use benchmarks such as the HEPSpec2006 which is a subset of the SPEC benchmarks using C++.

Past experience with virtualisation has made us cautious to overcommit on memory. A hypervisor that starts swapping can cause a significant impact on the all VMs. As we gain more operational experience, we may start memory overcommit but we need to establish a baseline performance first.

However, the disk configuration is becoming increasing a problem. Since our hypervisors run a variety of workload, we need to mirror the disks to ensure a reasonable reliability and also to avoid the operational work of having to re-install and for the users to re-create VMs after every disk failure. We use Linux software RAID on the KVM hypervisors running on Scientific Linux 6 (a derivative of RHEL). We have experimented with different combinations for the 3rd disk between making it a spare or a 3rd mirror. Currently, we are running in a 3-way mirror as we found some Linux stability issues on RAID-1 with spare.

The hardware itself was purchased for running bare-metal classic High Throughput Computing batch services. Typical configurations are based around Supermicro Quad systems assembled by European resellers. With these configurations, you would run a single instance of Linux on bare metal and have a batch scheduler (CERN uses LSF) to run the varied workload with fair share between the users.

However, when we use a similar configuration for hypervisors, some interesting effects emerge.
  • Space becomes more limited. Having some space in /var for logs and crash dumps is standard for a Linux host but when we add glance image caches and the backing store for the VMs along with mirroring the disks, it starts to get tight on the hypervisors with 2TB of space.
  • We could potentially be running 48 m1.tiny configurations which would require 960GB of disk space to support their VMs. Operations like suspend to disk become operationally difficult.
  • With only 3 spindles, we are limited for IOPS. The impact of this is reduced since much of the High Energy Physics code is CPU bound or directly accessing storage over the network using protocols such as HTTP or root (a specific protocol developed for accessing HEP data sets)
  • I/O patterns emerge according to the standard Linux schedules. Typical cases of Linux scheduling are yum updates and updatedb runs for the locate command which use the cron.daily schedules at 4am. Suddenly, we have 48 VMs all running updatedb and yum update at exactly the same moment with 3 disks
We get requests for special flavors with more than 4 cores, very large memory or multi-terabyte large system disks. Analysing these cases, there are a number of motivations.
  • Some applications are still scale-up rather than scale-out. Most of these cases we're suggesting that people delay moving to the CERN private cloud for a few months as we are running at basic service levels (equivalent to Amazon) and the scale up applications tend to be server consolidation rather than cloud applications. With the improvements coming in Havana and as we bring cinder external block storage online, more of these use cases can be considered.
  • In other cases, there is scalability issue with the application itself. As we add more nodes, contention on distributed systems often encounter bottlenecks and show non-linear performance changes. Creating a new VM for each single core batch job would create a significant increase compared to today's load of around 7,000 physical servers.
  • Large external storage is a common request. These applications such as Cassandra or MongoDB are the cases we classify as 'Hippos'. Using the Pets/Cattle analogy from Microsoft/Cloudscaling, we have a cattle server which is redundant but with a large volume of disk storage. These are best served with a cinder like solution rather than by creating a system disk of multi-terabyte size. We've been evaluating NetApp, Gluster and Ceph storage solutions in this area and plan to bring a production service online later in the year.
  • While we currently do not use live migration, we will be using it more as we increase the service levels to cover some of the server consolidation use cases in specialised zones within the CERN cloud. Experience with our previous service consolidation environment has shown that large memory servers have proved a major difficulty such as  transferring 64GB or more of virtual machine in a consistent but transactional mode. Some VMs change their memory faster than we can transfer it between the old and new hypervisors. Thus, we limit our out-of-the-box flavors to 8GB and review other cases to understand the application needs further.
While these are reasonable operational restrictions, one of the challenges of the private cloud is how to handle exceptions. Within a private cloud model, assuming no cross charging but a quota model based on pledges, there is a need to reflect the cost of unusual configurations. A 64GB, 8 core VM with 1TB of system disk would be very difficult to pack with a combination of 2GB/1core/20GB VMs. This leads to inefficiencies in resource utilisation for CPU, memory or disk.

As we look out to the future, there are a number of positive developments for addressing the flavor sprawl.
  • External block storage functionality with Cinder will allow us to cover the large disk storage 'Hippo' use case. An m1.medium can ask for an external volume and use that for database storage.
  • We are investigating Linux KSM options to cover scenarios where multiple identical Linux images are running on a single hypervisor. Under these scenarios, KSM would share the code pages providing significant optimisations for the small VM packing scenarios.
  • Future procurement rounds will be looking at configurations specifically for hypervisors rather than the current approach of re-cycling existing servers which had been purchased for other application profiles. A wide range of options from SSDs for higher IOPS, more disks or even further exploitation of external storage are being investigated.
Overall, the cloud model provides huge flexibility for our users to ask for the configurations they need. In the past, a custom configuration would take many months to deliver (using public procurement models of market surveys, specifications, tendering, adjudication, ordering, installation and burn-in). Physicists can now ask for a new VM and get it within the time to get a coffee. 

While many flavors can provide flexibility, we should not lose sight of the need to maximise efficiency and make sure that CPU, memory and disk are all used at a higher level in the cloud than previously with bare metal dedicated resources.

Upcoming work in this area is to
  • Investigate a current limitation in the experimental cells functionality within OpenStack that we use to achieve scalability. Flavor requests are not passed to child cells and thus adding new flavors is a manual process. We will be working with the community to address this restriction.
  • Exploit underlying virtualisation and block storage solutions to provide standard flavors with the flexibility for additional services which could cover their requirement. Cinder with many back end drivers is one of our top priority areas to deploy.

Thursday, 1 August 2013

The First Week - Projects

The First Week - Hot Topics - Projects!

At CERN, we've recently gone live with our OpenStack based cloud for 11,000 physicists around the world.

The major efforts prior to going live were to perform the integration into the CERN identity management system, define projects and roles and configure the components in a high availability set up using the Puppetlabs tools. Much of this work was generic and patches have been submitted back to the community.

Surprisingly, the major topics on our cloud go-live day were not how to implement applications using cloud technologies, how accounting is performed or support levels for non-standard images.

Instead, the support lines were hot with two topics... flavors and projects! I'll cover flavors in a subsequent posting.

For projects, each user signing up to CERN private cloud is given an personal quota of 10 cores so they can work in a sandbox to understand how to use clouds. The vast majority of resources will be allocated to shared projects where multiple users collaborate together to simulate physics and analyse the results from the Large Hadron Collider to match with the theory.

Users want a descriptive name for their project. Our standard request form asks
  • What is the name of your project ?
  • How many cores/MB memory/GB disk/etc would you like ?
  • Who is the owner for the project ? 
  • Who are the administrators ?
These do not seem to be Bridge of Death questions but actually require much reflection.

From the user's perspective, the name should be simple such as 'Simulation'. From the cloud provider's needs, we have multiple user groups to support so the name need to be unique and clear.
With OpenStack, there are upcoming concepts such as domains which will allow us to group a set of projects together and we wish to prepare the ground for those. So, there is a need for fine grain project definition awaiting future methods to group these projects together.

In the end, we settled on
  • Projects start with the LHC experiment such as ATLAS or CMS. The case reflects whether they are acronyms or not (ATLAS stands for A Toroidal LHC Apparatus, CMS for Compact Muon Solenoid which is ironic for something weighing over 12,000 tonnes)
  • Many requests asked for _ or - in the names. We prefer spaces. There will be bugs (we've already fixed one in the dashboard) but this is desirable in the long term.
  • For projects run by the IT department, we use our service catalog based on the ITIL methodology so that each functional element can request a project.
Keeping projects small and focused has a number of benefits
  • The accounting can be more precise as the association between a project and a service is clearer.
  • The list of members for a project can be kept small. With the Grizzly based OpenStack cloud, the policies allow members to restart other VMs in the project. This allows sharing of administration responsibilities but can cause issues if multiple teams are members of a single project.
The disadvantages are
  • Additional administration to set up the projects
  • Dependencies on the upcoming domain features in Keystone which should be arriving in the Havana release
  • Quota management is more effort for the support teams. With large projects, the central team can allocate out a large block of quota to a project and leave the project team to look after the resources.
Ultimately, there is a need for an accounting structure of projects and domains so we can track who is using what. Getting the domain/project structure right at the start feeds into the quota management and accounting model in the future.

The good news on the projects and quotas is that Havana has some interesting improvements which will improve the support load