LHC Tunnel

LHC Tunnel

Wednesday 5 August 2015

Tuning hypervisors for High Throughput Computing

Over the past set of blogs, we've looked at a number of different options for tuning High Energy Physics workloads in a KVM environment such as the CERN OpenStack cloud.

This is a summary of the findings using the HEPSpec 06 benchmark on KVM and a comparison with Hyper-V for the same workload.

For KVM on this workload, we saw a degradation in performance on large VMs.


Results for other applications may vary so each option should be verified for the target environment. The percentages from our optimisations are not necessarily additive but give an indication of the performance improvements to be expected. After tuning, we saw around 5% overhead from the following improvements.

OptionImprovementComments
CPU topology~0The primary focus for this function was not for performance so result is as expected
Host Model4.1-5.2%Some impacts on operations such as live migration
Turn EPT off6%Open bug report for CentOS 7 guest on CentOS 7 hypervisor
Turn KSM off0.9%May lead to an increase in memory usage
NUMA in guest~9%Needs Kilo or later to generate this with OpenStack
CPU Pinning~3%Needs Kilo or later (cumulative on top of NUMA)

Different applications will see a different range of improvements (or even that some of these options degrade performance). Experiences from other workload tuning would be welcome.

One of the things that led us to focus on KVM tuning was the comparison with Hyper-V. At CERN, we made an early decision to run a multi-hypervisor cloud building on the work by cloudbase.it and Puppet on Windows to share the deployment scripts for both CentOS and Windows hypervisors. This allows us to direct appropriate workloads to the best hypervisor for the job.

One of the tests when we saw a significant overhead on the default KVM configuration was to compare the performance overheads for a Linux configuration on Hyper-V. Interestingly, Hyper-V achieved better performance without tuning compared to the configurations with KVM. Equivalent tests on Hyper-V showed
  • 4 VMs 8 cores: 0.8% overhead compared to bare metal 
  • 1 VM 32 cores: 3.3% overhead compared to bare metal
These performance results allowed us to focus on the potential areas for optimisation, that we needed to tune the hypervisor rather than a fundamental problem with virtualisation (with the results above for NUMA and CPU pinning)

The Hyper-V configuration pins each core to the underlying  NUMA socket which is similar to how the Kilo NUMA tuning sets KVM up.


and


This gives the Linux guest configuration as seen from the guest running on a Hyper-V hypervisor

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 28999 MB
node 0 free: 27902 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 29000 MB
node 1 free: 28027 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

Thanks to the QEMU discuss mailing list and to the other team members who helped understand the issue (Sean Crosby (University of Melbourne) and Arne Wiebalck, Sebastian Bukowiec and Ulrich Schwickerath (CERN))

References



Monday 3 August 2015

NUMA and CPU Pinning in High Throughput Computing


CERN's OpenStack cloud runs the Juno release on mainly CentOS 7 hypervisors.
Along with previous tuning options described in this blog which can be used on Juno, a number of further improvements have been delivered in Kilo.

Since this release will be installed at CERN during the autumn, we had to configure standalone KVM configurations to test the latest features, in particular around NUMA and CPU pinning.

NUMA features have been appearing in more recent processors that means memory accesses are no longer uniform. Rather than a single large pool of memory accessed by the processors, the performance of the memory access varies according to whether the memory is local to the processor.


A typical case above is where VM 1 is running on CPU 1 and needs a page of memory to be allocated. It is important that the memory allocated by the underlying hypervisor is the fastest access possible for the VM1 to access in future. Thus, the guest VM kernel needs to be aware of the underlying memory architecture of the hypervisor.

The NUMA configuration of a machine can be checked using lscpu. This shows two NUMA nodes on CERN's standard server configurations (two processors with 8 physical cores and SMT enabled)

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Stepping:              4
CPU MHz:               2257.632
BogoMIPS:              5206.18
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31

Thus, cores 0-7 and 16-23 are attached to the first NUMA node with the others on the second. The two ranges come from SMT. VMs however see a single NUMA node.

NUMA node0 CPU(s): 0-31


First Approach - numad

The VMs on the CERN cloud are distributed across different sizes. Since there is a mixture of VM sizes, NUMA has a correspondingly varied influence.



Linux provides the numad daemon which provides some automated balancing of NUMA workloads to move memory near to the processor where the thread is running.

In the case of 8 core VMs, numad on the hypervisor provided a performance gain of 1.6%.  However, the effects for larger VMs was much less significant. Looking at the performance for running 4x8 core VMs versus 1x32 core VM, there was significantly more overhead for the large VM case.




Second approach - expose NUMA to guest VM

This can be done using appropriate KVM directives. With OpenStack Kilo, these will be possible via the flavors extra specs and image properties. In the meanwhile, we configured the hypervisor with the following XML for libvirt.

<cpu mode='host-passthrough'>
<numa>
<cell id='0' cpus='0-7' memory='16777216'/>
<cell id='1' cpus='16-23' memory='16777216'/>
<cell id='2' cpus='8-15' memory='16777216'/>
<cell id='3' cpus='24-31' memory='16777216'/>
</numa>
</cpu>

In an ideal world, there would be two cells defined (0-7,16-23 and 8-15,24-31) but KVM currently does not support non-contiguous ranges on CentOS 7 [1]. The guests see the configuration as follows

# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Stepping: 4
CPU MHz: 2593.750
BogoMIPS: 5187.50
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-31

With this approach and turning off numad on the hypervisor, the performance of the large VM improved by 9%.

We also investigated the numatune options but these did not produce a significant improvement.

Third Approach - Pinning CPUs

From the hypervisor's perspective, the virtual machine appears as a single process which needs to be scheduled on the available CPUs. While the NUMA configuration above means that memory access from the processor will tend to be local, the hypervisor may then choose to place the next scheduled clock tick on a different processor. While this is useful in the case of hypervisor over-commit, for a CPU bound application, this leads to less memory locality.

With Kilo, it will be possible to pin a virtual core to a physical one. The same was done using the hypervisor XML as for NUMA.

<cputune>
<vcpupin vcpu="0" cpuset="0"/>
<vcpupin vcpu="1" cpuset="1"/>
<vcpupin vcpu="2" cpuset="2"/>
<vcpupin vcpu="3" cpuset="3"/>
<vcpupin vcpu="4" cpuset="4"/>
<vcpupin vcpu="5" cpuset="5"/>
...

This will mean that the virtual core #1 is always run on the physical core #1.
Repeating the large VM test provided a further 3% performance improvement.

The exact topology has been set in a simple fashion. Further investigation on getting exact mappings between thread siblings is needed to get the most of out of the tuning.

The impact on smaller VMs (8 and 16 core) is also needing to be studied. Optimising for one use case has a risk that other scenarios may be affected. Custom configurations for particular topologies of VMs increases the operations effort to run a cloud at scale. While the changes should be positive, or at minimum neutral, this needs to be verified.

Summary

Exposing the NUMA nodes and using CPU pinning has reduced the large VM overhead with KVM from 12.9% to 3.5%. When the features are available in OpenStack Kilo, these can be deployed by setting up the appropriate flavors with the additional pinning and NUMA descriptions for the different hardware types so that large VMs can be run at a much lower overhead.

This work was in collaboration with Sean Crosby (University of Melbourne) and Arne Wiebalck and Ulrich Schwickerath (CERN).

Previous blogs in this series are

Updates

[1] RHEV does support this with the later QEMU rather than the default in CentOS 7 (http://cbs.centos.org/repos/virt7-kvm-common-testing/x86_64/os/Packages/, version 2.1.2)

References

Sunday 2 August 2015

EPT and KSM for High Throughput Computing

As part of the analysis of CERN's compute intensive workload in a virtualised infrastructure, we have been examining various settings of KVM to tune the performance.

EPT

EPT is an Intel technology which provides hardware assist for virtualisation and is one of the options as part of the Intel KVM driver in Linux. This is turned on by default but can be controlled using the options on the KVM driver. The driver can then be reloaded as long as no qemu-kvm processes are running.

# cat /etc/modprobe.d/kvm_intel.conf
options kvm_intel ept=0
# modprobe -r kvm_intel
# modprobe kvm_intel

In past studies, EPT has had a negative performance impact on High Energy Physics applications.  With recent changes in processor architecture, this was re-tested as follows.


This is a 6% performance improvement with EPT off. This seems surprising as the functions are intended to improve virtualisation performance rather than reduce it.

The CERN configuration uses hypervisors running CentOS 7 and guests running Scientific Linux CERN 6. With this configuration, EPT can be turned off without problems but a recent test with CentOS 7 guests has shown that this functionality has an issue which has been reported upstream. Only one CPU is recognised and the rest are reported as being unresponsive.

KSM

Kernel same-page merging is a technology which finds common memory pages inside a linux system and merges the pages so there is only a single copy, saving memory resources. In the event of one of the copies being updated, a new copy is created so the function is transparent to the processes on the system.

For hypervisors, this can be very beneficial where multiple guests are running with the same level of operating system. However, there is an overhead due to the scanning process which may cause the applications to run more slowly. 

We benchmarked 4 VMs, each 8 cores, running the same operating system levels. The results were that KSM causes an overhead of around 1%.



To turn KSM off, the ksmtuned daemon should be stopped.

systemctl disable ksmtuned

The ksmd kernel thread still seems to run but does not use any CPU resources. Following the change, it is important to verify that there is still sufficient memory on the hypervisor since not merging the pages could cause an increase in memory usage and lead to swapping (which is a very significant performance impact)

This work was in collaboration with Sean Crosby (University of Melbourne) and Arne Wiebalck and Ulrich Schwickerath  (CERN).

Previous blogs in this series are

References

  • Intel article on EPT - https://01.org/blogs/tlcounts/2014/virtualization-advances-performance-efficiency-and-data-protection
  • Previous studies with KVM and HEP code - https://indico.cern.ch/event/35523/session/28/contribution/246/attachments/705127/968004/HEP_Specific_Benchmarks_of_Virtual_Machines_on_multi-core_CPU_Architectures.pdf
  • VMWare paper at https://www.vmware.com/pdf/Perf_ESX_Intel-EPT-eval.pdf





CPU Model Selection for High Throughput Computing

As part of the work to tune the configuration of the CERN cloud, we have been exploring various options for tuning compute intensive workloads.

One option in the Nova configuration allows the model of CPU visible in the guest to be configured between different alternatives.

The choices are as follows
  • host passthrough provides an exact view of the underlying processor
  • host model provides a view of a processor model which is close to the underlying processor but gives the same view for several processors, e.g. a range of different frequencies within the same processor family
  • custom allows the administrator to provide a view selecting the exact characteristics of the processor
  • none gives the hypervisor default configuration
There are a number of factors to consider for this selection
  • Migration between hypervisors has to be done with the same processor in the guest. Thus, if host passthrough is configured and the VM is migrated to a new generation of servers with a different processor, this operation will fail.
  • Performance will vary with host passthrough being the fastest as the application can use the full feature set of the processor. The extended instructions available will vary as shown at the end of this article where different settings give different flags.
The exact performance impact will vary according to the application. High Energy Physics uses a benchmark suite HEPSpec06 which is a subset of the SPEC 2006 benchmarks. Using this combination, we observed around 4% reduction in performance of CPU bound applications using host model. Moving to the default was an overhead of 5%.


Given the significant differences, the CERN cloud is configured such that
  • hypervisors running compute intensive workloads are configured for maximum performance (passthrough). These workloads are generally easy to re-create so there is no need for migration between hypervisors (such as warranty replacement) but instead new instances can be created on the new hardware and the old instances deleted
  • hypervisors running services are configured with host model so that they can be migrated between generations of equipment and between hypervisors if required such as for an intervention
In the future, we would be interested in making this setting an option for VM creation such as meta data on the nova boot command or a specific property on an image so end users could choose the appropriate option for their workloads.

host-passthrough

# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
stepping        : 4
microcode       : 1
cpu MHz         : 2593.748
cache size      : 4096 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good unfair_spinlock pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms
bogomips        : 5187.49
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

host-model

# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel Xeon E312xx (Sandy Bridge)
stepping        : 1
microcode       : 1
cpu MHz         : 2593.748
cache size      : 4096 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good unfair_spinlock pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms
bogomips        : 5187.49
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

none

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 13
model name      : QEMU Virtual CPU version 1.5.3
stepping        : 3
microcode       : 1
cpu MHz         : 2593.748
cache size      : 4096 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 4
wp              : yes
flags           : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good unfair_spinlock pni cx16 hypervisor lahf_lm
bogomips        : 5187.49
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Previous blogs in this series are
  • CPU topology - http://openstack-in-production.blogspot.fr/2015/08/openstack-cpu-topology-for-high.html

Contributions from Ulrich Schwickerath and Arne Wiebalck have been included in this article.

Saturday 1 August 2015

OpenStack CPU topology for High Throughput Computing

We are starting to look at the latest features of OpenStack Juno and Kilo as part of the CERN OpenStack cloud to optimise a number of different compute intensive applications.

We'll break down the tips and techniques into a series of small blogs. A corresponding set of changes to the upstream documentation will also be made to ensure the options are documented fully.

In the modern CPU world, a server consists of multiple levels of processing units.
  • Sockets where each of the processor chips are inserted
  • Cores where each processors contain multiple processing units which can run multiple processes in parallel
  • Threads (if settings such as SMT are enabled) may allow multiple processing threads to be active at the expense of sharing a core
The typical hardware used at CERN is a 2 socket system. This provides optimum price performance for our typical high throughput applications which simulate and process events from the Large Hadron Collider. The aim is not to process a single event as quickly as possible but rather to process the maximum number of events within a given time (within the total computing budget available). As the price of processors vary according to the performance, the selected systems are often not the fastest possible but the ones which give the best performance/CHF.

A typical example of this approach is in our use of SMT which leads to a 20% increase in total throughput although each individual thread runs correspondingly slower. Thus, the typical configuration is

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Stepping:              4
CPU MHz:               2999.953
BogoMIPS:              5192.93
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31


By default in OpenStack, the virtual CPUs in a guest are allocated as standalone processors. This means that for a 32 vCPU VM, it will appear as

  • 32 sockets
  • 1 core per socket
  • 1 thread per socket
As part of ongoing performance investigations, we wondered about the impact of this topology on CPU bound applications.

With OpenStack Juno, there is a mechanism to pass the desired topology. This can be done through flavors or image properties.

The names are slightly different between the two usages, with flavors using properties which start hw: and images with properties starting hw_

The flavor configurations are set by the cloud administrators and the image properties can be set by the project members. The cloud administrator can also set maximum values (i.e. hw_max_cpu_cores) so that the project members cannot define values which are incompatible with the underlying resources.


$ openstack image set --property hw_cpu_cores=8 --property hw_cpu_threads=2 --property hw_cpu_sockets=2 0215d732-7da9-444e-a7b5-798d38c769b5

The VM which is booted then has this configuration reflected.

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2593.748
BogoMIPS:              5187.49
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K

NUMA node0 CPU(s):     0-31

While this gives the possibility to construct interesting topologies, the performance benefits are not clear. The standard High Energy Physics benchmark show no significant change. Given that there is no direct mapping between the cores in the VM and the underlying physical ones, this may be because the cores are not pinned to the corresponding sockets/cores/threads and thus Linux may be optimising for a virtual configuration rather than the real one.

This work was in collaboration with Sean Crosby (University of Melbourne) and Arne Wiebalck (CERN).

The following documentation reports have been raised
  • Flavors Extra Specs -  https://bugs.launchpad.net/openstack-manuals/+bug/1479270
  • Image Properties - https://bugs.launchpad.net/openstack-manuals/+bug/1480519

References