Wednesday, 7 August 2013

Flavors - An English perspective

OpenStack has the capability to define flavors of virtual machines, how many cores, swap, disk and memory.

As a native UK English speaker, I find the term flavor to already be a problem. My spell checker fixes it to Flavour and requires regular manual changes. I appeal to the OpenStack technical committee to not accept any term which is not the same in US and UK English or to allow an alias :-)

At CERN, we make available 4 standard flavors modeled on the Amazon ones. These common names are already familiar to public cloud users and allows some better compatibility with scripts using EC2.

Flavor
vCores
Memory
System Disk
m1.tiny
1
0.5GB
20GB
m1.small
1
2GB
20GB
m1.medium
2
4GB
40GB
m1.large
4
8GB
80GB

For most cases, this set allows us to cover the configurations physicists ask for. There are some inefficiencies which can occur if an app requires 8GB but not much CPU power, or needs an 80GB disk but not much memory. These can be addressed by some overcommitting.

Currently, we overcommit on CPU and also use SMT. The current configuration of hypervisors is 24 core (i.e. 48 core with SMT enabled), 96GB memory and 3 2TB disks. This matches the configuration of the above flavors for memory and CPU and produced a configuration which is around twice the per-core performance on Amazon when we compare use benchmarks such as the HEPSpec2006 which is a subset of the SPEC benchmarks using C++.

Past experience with virtualisation has made us cautious to overcommit on memory. A hypervisor that starts swapping can cause a significant impact on the all VMs. As we gain more operational experience, we may start memory overcommit but we need to establish a baseline performance first.

However, the disk configuration is becoming increasing a problem. Since our hypervisors run a variety of workload, we need to mirror the disks to ensure a reasonable reliability and also to avoid the operational work of having to re-install and for the users to re-create VMs after every disk failure. We use Linux software RAID on the KVM hypervisors running on Scientific Linux 6 (a derivative of RHEL). We have experimented with different combinations for the 3rd disk between making it a spare or a 3rd mirror. Currently, we are running in a 3-way mirror as we found some Linux stability issues on RAID-1 with spare.

The hardware itself was purchased for running bare-metal classic High Throughput Computing batch services. Typical configurations are based around Supermicro Quad systems assembled by European resellers. With these configurations, you would run a single instance of Linux on bare metal and have a batch scheduler (CERN uses LSF) to run the varied workload with fair share between the users.

However, when we use a similar configuration for hypervisors, some interesting effects emerge.
  • Space becomes more limited. Having some space in /var for logs and crash dumps is standard for a Linux host but when we add glance image caches and the backing store for the VMs along with mirroring the disks, it starts to get tight on the hypervisors with 2TB of space.
  • We could potentially be running 48 m1.tiny configurations which would require 960GB of disk space to support their VMs. Operations like suspend to disk become operationally difficult.
  • With only 3 spindles, we are limited for IOPS. The impact of this is reduced since much of the High Energy Physics code is CPU bound or directly accessing storage over the network using protocols such as HTTP or root (a specific protocol developed for accessing HEP data sets)
  • I/O patterns emerge according to the standard Linux schedules. Typical cases of Linux scheduling are yum updates and updatedb runs for the locate command which use the cron.daily schedules at 4am. Suddenly, we have 48 VMs all running updatedb and yum update at exactly the same moment with 3 disks
We get requests for special flavors with more than 4 cores, very large memory or multi-terabyte large system disks. Analysing these cases, there are a number of motivations.
  • Some applications are still scale-up rather than scale-out. Most of these cases we're suggesting that people delay moving to the CERN private cloud for a few months as we are running at basic service levels (equivalent to Amazon) and the scale up applications tend to be server consolidation rather than cloud applications. With the improvements coming in Havana and as we bring cinder external block storage online, more of these use cases can be considered.
  • In other cases, there is scalability issue with the application itself. As we add more nodes, contention on distributed systems often encounter bottlenecks and show non-linear performance changes. Creating a new VM for each single core batch job would create a significant increase compared to today's load of around 7,000 physical servers.
  • Large external storage is a common request. These applications such as Cassandra or MongoDB are the cases we classify as 'Hippos'. Using the Pets/Cattle analogy from Microsoft/Cloudscaling, we have a cattle server which is redundant but with a large volume of disk storage. These are best served with a cinder like solution rather than by creating a system disk of multi-terabyte size. We've been evaluating NetApp, Gluster and Ceph storage solutions in this area and plan to bring a production service online later in the year.
  • While we currently do not use live migration, we will be using it more as we increase the service levels to cover some of the server consolidation use cases in specialised zones within the CERN cloud. Experience with our previous service consolidation environment has shown that large memory servers have proved a major difficulty such as  transferring 64GB or more of virtual machine in a consistent but transactional mode. Some VMs change their memory faster than we can transfer it between the old and new hypervisors. Thus, we limit our out-of-the-box flavors to 8GB and review other cases to understand the application needs further.
While these are reasonable operational restrictions, one of the challenges of the private cloud is how to handle exceptions. Within a private cloud model, assuming no cross charging but a quota model based on pledges, there is a need to reflect the cost of unusual configurations. A 64GB, 8 core VM with 1TB of system disk would be very difficult to pack with a combination of 2GB/1core/20GB VMs. This leads to inefficiencies in resource utilisation for CPU, memory or disk.

As we look out to the future, there are a number of positive developments for addressing the flavor sprawl.
  • External block storage functionality with Cinder will allow us to cover the large disk storage 'Hippo' use case. An m1.medium can ask for an external volume and use that for database storage.
  • We are investigating Linux KSM options to cover scenarios where multiple identical Linux images are running on a single hypervisor. Under these scenarios, KSM would share the code pages providing significant optimisations for the small VM packing scenarios.
  • Future procurement rounds will be looking at configurations specifically for hypervisors rather than the current approach of re-cycling existing servers which had been purchased for other application profiles. A wide range of options from SSDs for higher IOPS, more disks or even further exploitation of external storage are being investigated.
Overall, the cloud model provides huge flexibility for our users to ask for the configurations they need. In the past, a custom configuration would take many months to deliver (using public procurement models of market surveys, specifications, tendering, adjudication, ordering, installation and burn-in). Physicists can now ask for a new VM and get it within the time to get a coffee. 

While many flavors can provide flexibility, we should not lose sight of the need to maximise efficiency and make sure that CPU, memory and disk are all used at a higher level in the cloud than previously with bare metal dedicated resources.

Upcoming work in this area is to
  • Investigate a current limitation in the experimental cells functionality within OpenStack that we use to achieve scalability. Flavor requests are not passed to child cells and thus adding new flavors is a manual process. We will be working with the community to address this restriction.
  • Exploit underlying virtualisation and block storage solutions to provide standard flavors with the flexibility for additional services which could cover their requirement. Cinder with many back end drivers is one of our top priority areas to deploy.