LHC Tunnel

LHC Tunnel

Sunday, 29 November 2015

Our cloud in Kilo


Following on from previous upgrades, CERN migrated the OpenStack cloud to Kilo during September to November. Along with the bug fixes, we are planning on exploiting the significant number of new features, especially as related to performance tuning. The overall cloud architecture was covered at the Tokyo OpenStack summit video https://www.openstack.org/summit/tokyo-2015/videos/presentation/unveiling-cern-cloud-architecture.

As the LHC continues to run 24x7, these upgrades were done while the cloud was running and virtual machines were untouched.

Previous upgrades have been described as below
The staged approach was used again. While most of the steps went smoothly, a few problems were encountered.
  • Cinder - we encountered the bug https://bugs.launchpad.net/cinder/+bug/1455726 which led to a foreign key error. The cause appears to be related to UTF8. The patch (https://review.openstack.org/#/c/183814/) was not completed so did not get included into the release. More details at the thread at http://lists.openstack.org/pipermail/openstack/2015-August/013601.html.
  • Keystone - one of the configuration parameters for caches had changed syntax and this was not reflected in the configuration generated by Puppet. The symptoms were high load on the Keystone servers since caching was not enabled.
  • Glance - given the rolling upgrade on Glance, we took advantage of having virtualised the majority of the Glance server pool. This allows new resources to be brought online with a Juno configuration and the old ones deleted.
  • Nova - we upgraded the control plane services along with the QA compute nodes. With the versioned objects, we could stage the migration of the thousands of compute nodes so that we did not need to do all the updates at once. Puppet looked after the appropriate deployments of the RPMs.
    • Following the upgrade, we had an outage of the metadata service for the OpenStack specific metadata. The EC2 metadata works fine. This is a cells related issue and we'll create a bug/blueprint for the fix.
    • The VM resize functions are giving errors during the execution. We're tracking this with the upstream developers.
    • We wanted to use the latest Nova NUMA features. We encountered a problem with cells and this feature, although it worked well in a non-cells cloud. This is being tracked in https://bugs.launchpad.net/nova/+bug/1517006. We will use the new features for performance optimisation once these problems are resolved.
    • The dynamic migration of flavors was only partially successful. With the cells database having the flavors data in two places, the migration needed to be done simultaneously. We resolved this by forcing the migration of the flavors to the new endpoint,
    • The handling of ephemeral drives in Kilo seems to be different from Juno. The option default_ephemeral_format now defaults to vfat, rather than ext3. The aim seems to have been to give vfat to Windows and ext4 to Linux but our environment does not follow this. This was reported by Nectar but we could not find any migration advice in the Kilo release notes. We have set the default to ext3 while we are working out the migration implications.
    • We're also working through a scaling problem for our most dynamic cells at https://bugs.launchpad.net/nova/+bug/1524114. Here all VMs are being queried by the scheduler, not just the active ones. Since we create/delete hundreds of VMs an hour, there are large volumes of deleted VMs which made one query take longer than expected.
Catching these cases with cells early is part of the work for the scope of the the Cell V2 project at https://wiki.openstack.org/wiki/Nova-Cells-v2 to which we are contributing along with the BARC centre in Mumbai so that the cells configuration becomes the default (with only a single cell) and the upstream test cases are enhanced to validate  the multi cell configuration.

As some of the hypervisors are still running Scientific Linux 6, we used the approach from GoDaddy to package the components using software collections. Details are available at https://github.com/krislindgren/openstack-venv-cent6. We used this for nova and ceilometer which are the agents installed on the hypervisors. The controllers were upgraded to CentOS 7 as part of the upgrade to Kilo.

Overall, getting to Kilo enables new features and includes bug fixes to reduce administration effort. Keeping up with new releases requires careful planning and sharing upstream activities such as the Puppet modules but has proven to be the best approach. With many of the CERN OpenStack team in the summit in Tokyo, we did not complete the upgrade before Liberty was released but this has been completed soon afterwards.

With the Kilo base in production, we are now ready to start work on the Nova network to Neutron migration, deployment of the new EC2 API project and enabling Magnum for container native applications.


Thursday, 8 October 2015

Scheduling and disabling Cells

In order to scale OpenStack Cloud Infrastructure at CERN, we were early to embrace an architecture that uses Cells. Cells is a Nova functionality that allows the partition a Cloud Infrastructure into smaller groups with independent control planes.

For large deployments Cells have several advantages like:

  • single endpoint to users; 
  • increase the availability and resilience of the Infrastructure; 
  • avoid that Nova and external components (DBs, message brokers) reach their limits; 
  • isolate different user cases; 
However, cells also have some limitations. There are some nova features that don't work when running cells:
  • Security Groups; 
  • Manage aggregates on Top Cell; 
  • Availability Zone support; 
  • Server groups; 
  • Cell scheduler limited functionality;
There has been many changes since we deployed our initial cells configuration two years ago. During the past months ,there have been a lot of work involving Cells, especially make sure that they are properly tested and developing CellsV2 that should be the default way to deploy Nova in the future.

However, today when using Cells we continue to receive following welcome message :)

"The cells feature of Nova is considered experimental by the OpenStack project because it receives much less testing than the rest of Nova. This may change in the future, but current deployers should be aware that the use of it in production right now may be risky."

At CERN, we now have 26 children cells supporting the 130,000 cores across two data centres in a single cloud. Some cells are dedicated for the general use cases and others that are dedicated only to specific projects.

In order to map projects to cells we developed a scheduler filter for the cell scheduler.

https://github.com/cernops/nova/blob/cern-2014.2.2-2/nova/cells/filters/target_cell_project.py

The filter relies in two new values defined in nova.conf: "cells_default" and "cells_projects".

  • "cells_default" contains the set of available cells to schedule instances if the project is not mapped to any specific cell; 
  • "cells_projects" contains the mapping cell -> project for the specific use cases; 

nova.conf 
cells_default=cellA,cellB,cellC,cellD 
cells_projects=cellE:<project_uuid1>;<project_uuid2>,cellF:<project_uuid3> 


For example, when an instance belonging to "project_uuid2" is created, it's schedule to "cellE". But, if the instance belongs to "project_uuid4" it's schedule to one of the default cells ("cellA", "cellB", "cellC", "cellD").

One of the problems when using cells is that is not possible to disable them from the scheduler.

With this scheduler filter we can achieve this. To disable a cell we just need to remove it from the "cells_default" or" cells_projects" list. Disabling a cell means that it will not be possible to create new instances on it, however it is still available to perform operations like restart, resize, delete, ...


These experiences will be discussed in the upcoming summit in Tokyo with the deep dive into the CERN OpenStack deployment (https://mitakadesignsummit.sched.org/event/f929ea7ee625dadcc16888cb33984dad#.VhbHu3pCrWI), at the Ops meetup (https://etherpad.openstack.org/p/TYO-ops-meetup) and Nova design sessions (https://mitakadesignsummit.sched.org/overview/type/nova#.VhbHaXpCrWI)



Saturday, 26 September 2015

EPT, Huge Pages and Benchmarking


Having reported that EPT has a negative influence on the High Energy Physics standard benchmark HepSpec06, we have started the deployment of those settings across the CERN OpenStack cloud,
  • Setting the flag in /etc/modprobe.d/kvm_intel.conf to off
  • Waiting for the work on each guest to finish after stopping new VMs on the hypervisor
  • Changing the flag and reloading the module
  • Enabling new work for the hypervisor
According to the HS06 tests, this should lead to a reasonable performance improvement based on the results of the benchmark and tuning. However, certain users reported significantly worse performance than previously. In particular, some workloads showed significant differences in the following before and after characteristics.

Before the workload was primarily CPU bound, spending most of its time in user space. CERN applications have to process significant amounts of data so it is not always possible to ensure 100% utilisation but the aim is to provide the workload with user space CPU.


When EPT was turned off. some selected hypervisors showed a very difference performance profile. A major increase in non-user load and a reduction in the throughput for the experiment workloads. However, this effect was not observed on the servers with AMD processors.


With tools such as perf, we were able to trace the time down to handling the TLB misses. Perf gives

78.75% [kernel] [k] _raw_spin_lock
6.76% [kernel] [k] set_spte
1.97% [kernel] [k] memcmp
0.58% [kernel] [k] vmx_vcpu_run
0.46% [kernel] [k] ksm_docan
0.44% [kernel] [k] vcpu_enter_guest

The process behind the _raw_spin_lock is qemu-kvm.

Using systemtap kernel backtraces, we see mostly page faults and spte_* commands (shadow page table updates)
Both of these should not be necessary if you have hardware support for address translation: aka EPT.

There may be specific application workloads where the EPT setting was non optimal. In the worst case, the performance was several times slower.  EPT/NPT increases the cost of doing page table walks when the page is not cached in the TLB. This document shows how processors can speed up page walks - http://www.cs.rochester.edu/~sandhya/csc256/seminars/vm_yuxin_yanwei.pdf and AMD includes a page walk cache in their processor which speeds up the walking of pages as described in this paper  http://vglab.cse.iitd.ac.in/~sbansal/csl862-virt/readings/p26-bhargava.pdf

In other words, EPT slows down HS06 results when there are small pages involved because the HS06 benchmarks miss the TLB a lot. NPT doesn't slow it down because AMD has a page walk cache to help speed up finding the pages when it's not in the TLB. EPT comes good again when we have large pages because it rarely results in a TLB miss. So, HS06 is probably representative of most of the job types, but the is a small share of jobs which are different and triggered the above-mentioned problem.

However, we have 6% overhead compared to previous runs due to EPT on for the benchmark as mentioned in the previous blog. Mitigating the EPT overheads following the comments on the previous blog, we looked into using dedicated Huge Pages. Our hypervisors run CentOS 7 and thus support both transparent huge pages and huge pages. Transparent huge pages performs a useful job under normal circumstances but are opportunistic in nature. They are also limited to 2MB and cannot use the 1GB maximum size.

We tried setting the default huge page to 1G using the Grub cmdline configuration.

$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
$ cat /boot/grub2/grub.cfg | grep hugepage
linux16 /vmlinuz-3.10.0-229.11.1.el7.x86_64 root=UUID=7d5e2f2e-463a-4842-8e11-d6fac3568cf4 ro rd.md.uuid=3ff29900:0eab9bfa:ea2a674d:f8b33550 rd.md.uuid=5789f86e:02137e41:05147621:b634ff66 console=tty0 nodmraid crashkernel=auto crashkernel=auto rd.md.uuid=f6b88a6b:263fd352:c0c2d7e6:2fe442ac vconsole.font=latarcyrheb-sun16 vconsole.keymap=us LANG=en_US.UTF-8 default_hugepagesz=1G hugepagesz=1G hugepages=55 transparent_hugepage=never
$ cat /sys/module/kvm_intel/parameters/ept
Y

It may also be advisable to disable tuned for the moment until the bug #1189868 is resolved.

We also configured the XML manually to include the necessary huge pages. This will be available as a flavor or image option when we upgrade to Kilo in a few weeks.

  <memoryBacking>
        <hugepages>
          <page size="1" unit="G" nodeset="0-1"/>
        </hugepages>
  </memoryBacking>

The hypervisor was configured with huge pages enabled. However, we saw a problem with the distribution of huge pages across the NUMA nodes.

$ cat /sys/devices/system/node/node*/meminfo | fgrep Huge
Node 0 AnonHugePages: 311296 kB
Node 0 HugePages_Total: 29
Node 0 HugePages_Free: 0
Node 0 HugePages_Surp: 0
Node 1 AnonHugePages: 4096 kB
Node 1 HugePages_Total: 31
Node 1 HugePages_Free: 2

Node 1 HugePages_Surp: 0

This shows that the pages were not evenly distributed across the NUMA nodes., which would lead to subsequent performance issues. The suspicion is that the Linux boot up sequence led to some pages being used and this made it difficult to find contiguous blocks of 1GB for the huge pages. This led us to deploy 2MB pages rather than 1GB for the moment, while may not be the optimum setting allows better optimisations than the 4K settings and still gives some potential for KSM to benefit. These changes had a positive effect as the monitoring below shows when the reduction in system time.




At the OpenStack summit in Tokyo, we'll be having a session on Hypervisor Tuning so people are welcome to bring their experiences along and share the various options. Details of the session will appear at https://etherpad.openstack.org/p/TYO-ops-meetup.

Contributions from Ulrich Schwickerath and Arne Wiebalck (CERN) and Sean Crosby (University of Melbourne) have been included in this article along with the help of the LHC experiments to validate the configuration.


References

Wednesday, 5 August 2015

Tuning hypervisors for High Throughput Computing

Over the past set of blogs, we've looked at a number of different options for tuning High Energy Physics workloads in a KVM environment such as the CERN OpenStack cloud.

This is a summary of the findings using the HEPSpec 06 benchmark on KVM and a comparison with Hyper-V for the same workload.

For KVM on this workload, we saw a degradation in performance on large VMs.


Results for other applications may vary so each option should be verified for the target environment. The percentages from our optimisations are not necessarily additive but give an indication of the performance improvements to be expected. After tuning, we saw around 5% overhead from the following improvements.

OptionImprovementComments
CPU topology~0The primary focus for this function was not for performance so result is as expected
Host Model4.1-5.2%Some impacts on operations such as live migration
Turn EPT off6%Open bug report for CentOS 7 guest on CentOS 7 hypervisor
Turn KSM off0.9%May lead to an increase in memory usage
NUMA in guest~9%Needs Kilo or later to generate this with OpenStack
CPU Pinning~3%Needs Kilo or later (cumulative on top of NUMA)

Different applications will see a different range of improvements (or even that some of these options degrade performance). Experiences from other workload tuning would be welcome.

One of the things that led us to focus on KVM tuning was the comparison with Hyper-V. At CERN, we made an early decision to run a multi-hypervisor cloud building on the work by cloudbase.it and Puppet on Windows to share the deployment scripts for both CentOS and Windows hypervisors. This allows us to direct appropriate workloads to the best hypervisor for the job.

One of the tests when we saw a significant overhead on the default KVM configuration was to compare the performance overheads for a Linux configuration on Hyper-V. Interestingly, Hyper-V achieved better performance without tuning compared to the configurations with KVM. Equivalent tests on Hyper-V showed
  • 4 VMs 8 cores: 0.8% overhead compared to bare metal 
  • 1 VM 32 cores: 3.3% overhead compared to bare metal
These performance results allowed us to focus on the potential areas for optimisation, that we needed to tune the hypervisor rather than a fundamental problem with virtualisation (with the results above for NUMA and CPU pinning)

The Hyper-V configuration pins each core to the underlying  NUMA socket which is similar to how the Kilo NUMA tuning sets KVM up.


and


This gives the Linux guest configuration as seen from the guest running on a Hyper-V hypervisor

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 28999 MB
node 0 free: 27902 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 29000 MB
node 1 free: 28027 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

Thanks to the QEMU discuss mailing list and to the other team members who helped understand the issue (Sean Crosby (University of Melbourne) and Arne Wiebalck, Sebastian Bukowiec and Ulrich Schwickerath (CERN))

References



Monday, 3 August 2015

NUMA and CPU Pinning in High Throughput Computing


CERN's OpenStack cloud runs the Juno release on mainly CentOS 7 hypervisors.
Along with previous tuning options described in this blog which can be used on Juno, a number of further improvements have been delivered in Kilo.

Since this release will be installed at CERN during the autumn, we had to configure standalone KVM configurations to test the latest features, in particular around NUMA and CPU pinning.

NUMA features have been appearing in more recent processors that means memory accesses are no longer uniform. Rather than a single large pool of memory accessed by the processors, the performance of the memory access varies according to whether the memory is local to the processor.


A typical case above is where VM 1 is running on CPU 1 and needs a page of memory to be allocated. It is important that the memory allocated by the underlying hypervisor is the fastest access possible for the VM1 to access in future. Thus, the guest VM kernel needs to be aware of the underlying memory architecture of the hypervisor.

The NUMA configuration of a machine can be checked using lscpu. This shows two NUMA nodes on CERN's standard server configurations (two processors with 8 physical cores and SMT enabled)

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Stepping:              4
CPU MHz:               2257.632
BogoMIPS:              5206.18
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31

Thus, cores 0-7 and 16-23 are attached to the first NUMA node with the others on the second. The two ranges come from SMT. VMs however see a single NUMA node.

NUMA node0 CPU(s): 0-31


First Approach - numad

The VMs on the CERN cloud are distributed across different sizes. Since there is a mixture of VM sizes, NUMA has a correspondingly varied influence.



Linux provides the numad daemon which provides some automated balancing of NUMA workloads to move memory near to the processor where the thread is running.

In the case of 8 core VMs, numad on the hypervisor provided a performance gain of 1.6%.  However, the effects for larger VMs was much less significant. Looking at the performance for running 4x8 core VMs versus 1x32 core VM, there was significantly more overhead for the large VM case.




Second approach - expose NUMA to guest VM

This can be done using appropriate KVM directives. With OpenStack Kilo, these will be possible via the flavors extra specs and image properties. In the meanwhile, we configured the hypervisor with the following XML for libvirt.

<cpu mode='host-passthrough'>
<numa>
<cell id='0' cpus='0-7' memory='16777216'/>
<cell id='1' cpus='16-23' memory='16777216'/>
<cell id='2' cpus='8-15' memory='16777216'/>
<cell id='3' cpus='24-31' memory='16777216'/>
</numa>
</cpu>

In an ideal world, there would be two cells defined (0-7,16-23 and 8-15,24-31) but KVM currently does not support non-contiguous ranges on CentOS 7 [1]. The guests see the configuration as follows

# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Stepping: 4
CPU MHz: 2593.750
BogoMIPS: 5187.50
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-31

With this approach and turning off numad on the hypervisor, the performance of the large VM improved by 9%.

We also investigated the numatune options but these did not produce a significant improvement.

Third Approach - Pinning CPUs

From the hypervisor's perspective, the virtual machine appears as a single process which needs to be scheduled on the available CPUs. While the NUMA configuration above means that memory access from the processor will tend to be local, the hypervisor may then choose to place the next scheduled clock tick on a different processor. While this is useful in the case of hypervisor over-commit, for a CPU bound application, this leads to less memory locality.

With Kilo, it will be possible to pin a virtual core to a physical one. The same was done using the hypervisor XML as for NUMA.

<cputune>
<vcpupin vcpu="0" cpuset="0"/>
<vcpupin vcpu="1" cpuset="1"/>
<vcpupin vcpu="2" cpuset="2"/>
<vcpupin vcpu="3" cpuset="3"/>
<vcpupin vcpu="4" cpuset="4"/>
<vcpupin vcpu="5" cpuset="5"/>
...

This will mean that the virtual core #1 is always run on the physical core #1.
Repeating the large VM test provided a further 3% performance improvement.

The exact topology has been set in a simple fashion. Further investigation on getting exact mappings between thread siblings is needed to get the most of out of the tuning.

The impact on smaller VMs (8 and 16 core) is also needing to be studied. Optimising for one use case has a risk that other scenarios may be affected. Custom configurations for particular topologies of VMs increases the operations effort to run a cloud at scale. While the changes should be positive, or at minimum neutral, this needs to be verified.

Summary

Exposing the NUMA nodes and using CPU pinning has reduced the large VM overhead with KVM from 12.9% to 3.5%. When the features are available in OpenStack Kilo, these can be deployed by setting up the appropriate flavors with the additional pinning and NUMA descriptions for the different hardware types so that large VMs can be run at a much lower overhead.

This work was in collaboration with Sean Crosby (University of Melbourne) and Arne Wiebalck and Ulrich Schwickerath (CERN).

Previous blogs in this series are

Updates

[1] RHEV does support this with the later QEMU rather than the default in CentOS 7 (http://cbs.centos.org/repos/virt7-kvm-common-testing/x86_64/os/Packages/, version 2.1.2)

References

Sunday, 2 August 2015

EPT and KSM for High Throughput Computing

As part of the analysis of CERN's compute intensive workload in a virtualised infrastructure, we have been examining various settings of KVM to tune the performance.

EPT

EPT is an Intel technology which provides hardware assist for virtualisation and is one of the options as part of the Intel KVM driver in Linux. This is turned on by default but can be controlled using the options on the KVM driver. The driver can then be reloaded as long as no qemu-kvm processes are running.

# cat /etc/modprobe.d/kvm_intel.conf
options kvm_intel ept=0
# modprobe -r kvm_intel
# modprobe kvm_intel

In past studies, EPT has had a negative performance impact on High Energy Physics applications.  With recent changes in processor architecture, this was re-tested as follows.


This is a 6% performance improvement with EPT off. This seems surprising as the functions are intended to improve virtualisation performance rather than reduce it.

The CERN configuration uses hypervisors running CentOS 7 and guests running Scientific Linux CERN 6. With this configuration, EPT can be turned off without problems but a recent test with CentOS 7 guests has shown that this functionality has an issue which has been reported upstream. Only one CPU is recognised and the rest are reported as being unresponsive.

KSM

Kernel same-page merging is a technology which finds common memory pages inside a linux system and merges the pages so there is only a single copy, saving memory resources. In the event of one of the copies being updated, a new copy is created so the function is transparent to the processes on the system.

For hypervisors, this can be very beneficial where multiple guests are running with the same level of operating system. However, there is an overhead due to the scanning process which may cause the applications to run more slowly. 

We benchmarked 4 VMs, each 8 cores, running the same operating system levels. The results were that KSM causes an overhead of around 1%.



To turn KSM off, the ksmtuned daemon should be stopped.

systemctl disable ksmtuned

The ksmd kernel thread still seems to run but does not use any CPU resources. Following the change, it is important to verify that there is still sufficient memory on the hypervisor since not merging the pages could cause an increase in memory usage and lead to swapping (which is a very significant performance impact)

This work was in collaboration with Sean Crosby (University of Melbourne) and Arne Wiebalck and Ulrich Schwickerath  (CERN).

Previous blogs in this series are

References

  • Intel article on EPT - https://01.org/blogs/tlcounts/2014/virtualization-advances-performance-efficiency-and-data-protection
  • Previous studies with KVM and HEP code - https://indico.cern.ch/event/35523/session/28/contribution/246/attachments/705127/968004/HEP_Specific_Benchmarks_of_Virtual_Machines_on_multi-core_CPU_Architectures.pdf
  • VMWare paper at https://www.vmware.com/pdf/Perf_ESX_Intel-EPT-eval.pdf





CPU Model Selection for High Throughput Computing

As part of the work to tune the configuration of the CERN cloud, we have been exploring various options for tuning compute intensive workloads.

One option in the Nova configuration allows the model of CPU visible in the guest to be configured between different alternatives.

The choices are as follows
  • host passthrough provides an exact view of the underlying processor
  • host model provides a view of a processor model which is close to the underlying processor but gives the same view for several processors, e.g. a range of different frequencies within the same processor family
  • custom allows the administrator to provide a view selecting the exact characteristics of the processor
  • none gives the hypervisor default configuration
There are a number of factors to consider for this selection
  • Migration between hypervisors has to be done with the same processor in the guest. Thus, if host passthrough is configured and the VM is migrated to a new generation of servers with a different processor, this operation will fail.
  • Performance will vary with host passthrough being the fastest as the application can use the full feature set of the processor. The extended instructions available will vary as shown at the end of this article where different settings give different flags.
The exact performance impact will vary according to the application. High Energy Physics uses a benchmark suite HEPSpec06 which is a subset of the SPEC 2006 benchmarks. Using this combination, we observed around 4% reduction in performance of CPU bound applications using host model. Moving to the default was an overhead of 5%.


Given the significant differences, the CERN cloud is configured such that
  • hypervisors running compute intensive workloads are configured for maximum performance (passthrough). These workloads are generally easy to re-create so there is no need for migration between hypervisors (such as warranty replacement) but instead new instances can be created on the new hardware and the old instances deleted
  • hypervisors running services are configured with host model so that they can be migrated between generations of equipment and between hypervisors if required such as for an intervention
In the future, we would be interested in making this setting an option for VM creation such as meta data on the nova boot command or a specific property on an image so end users could choose the appropriate option for their workloads.

host-passthrough

# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
stepping        : 4
microcode       : 1
cpu MHz         : 2593.748
cache size      : 4096 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good unfair_spinlock pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms
bogomips        : 5187.49
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

host-model

# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel Xeon E312xx (Sandy Bridge)
stepping        : 1
microcode       : 1
cpu MHz         : 2593.748
cache size      : 4096 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good unfair_spinlock pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms
bogomips        : 5187.49
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

none

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 13
model name      : QEMU Virtual CPU version 1.5.3
stepping        : 3
microcode       : 1
cpu MHz         : 2593.748
cache size      : 4096 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 4
wp              : yes
flags           : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good unfair_spinlock pni cx16 hypervisor lahf_lm
bogomips        : 5187.49
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Previous blogs in this series are
  • CPU topology - http://openstack-in-production.blogspot.fr/2015/08/openstack-cpu-topology-for-high.html

Contributions from Ulrich Schwickerath and Arne Wiebalck have been included in this article.

Saturday, 1 August 2015

OpenStack CPU topology for High Throughput Computing

We are starting to look at the latest features of OpenStack Juno and Kilo as part of the CERN OpenStack cloud to optimise a number of different compute intensive applications.

We'll break down the tips and techniques into a series of small blogs. A corresponding set of changes to the upstream documentation will also be made to ensure the options are documented fully.

In the modern CPU world, a server consists of multiple levels of processing units.
  • Sockets where each of the processor chips are inserted
  • Cores where each processors contain multiple processing units which can run multiple processes in parallel
  • Threads (if settings such as SMT are enabled) may allow multiple processing threads to be active at the expense of sharing a core
The typical hardware used at CERN is a 2 socket system. This provides optimum price performance for our typical high throughput applications which simulate and process events from the Large Hadron Collider. The aim is not to process a single event as quickly as possible but rather to process the maximum number of events within a given time (within the total computing budget available). As the price of processors vary according to the performance, the selected systems are often not the fastest possible but the ones which give the best performance/CHF.

A typical example of this approach is in our use of SMT which leads to a 20% increase in total throughput although each individual thread runs correspondingly slower. Thus, the typical configuration is

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Stepping:              4
CPU MHz:               2999.953
BogoMIPS:              5192.93
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31


By default in OpenStack, the virtual CPUs in a guest are allocated as standalone processors. This means that for a 32 vCPU VM, it will appear as

  • 32 sockets
  • 1 core per socket
  • 1 thread per socket
As part of ongoing performance investigations, we wondered about the impact of this topology on CPU bound applications.

With OpenStack Juno, there is a mechanism to pass the desired topology. This can be done through flavors or image properties.

The names are slightly different between the two usages, with flavors using properties which start hw: and images with properties starting hw_

The flavor configurations are set by the cloud administrators and the image properties can be set by the project members. The cloud administrator can also set maximum values (i.e. hw_max_cpu_cores) so that the project members cannot define values which are incompatible with the underlying resources.


$ openstack image set --property hw_cpu_cores=8 --property hw_cpu_threads=2 --property hw_cpu_sockets=2 0215d732-7da9-444e-a7b5-798d38c769b5

The VM which is booted then has this configuration reflected.

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2593.748
BogoMIPS:              5187.49
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K

NUMA node0 CPU(s):     0-31

While this gives the possibility to construct interesting topologies, the performance benefits are not clear. The standard High Energy Physics benchmark show no significant change. Given that there is no direct mapping between the cores in the VM and the underlying physical ones, this may be because the cores are not pinned to the corresponding sockets/cores/threads and thus Linux may be optimising for a virtual configuration rather than the real one.

This work was in collaboration with Sean Crosby (University of Melbourne) and Arne Wiebalck (CERN).

The following documentation reports have been raised
  • Flavors Extra Specs -  https://bugs.launchpad.net/openstack-manuals/+bug/1479270
  • Image Properties - https://bugs.launchpad.net/openstack-manuals/+bug/1480519

References