LHC Tunnel

LHC Tunnel

Saturday 26 September 2015

EPT, Huge Pages and Benchmarking


Having reported that EPT has a negative influence on the High Energy Physics standard benchmark HepSpec06, we have started the deployment of those settings across the CERN OpenStack cloud,
  • Setting the flag in /etc/modprobe.d/kvm_intel.conf to off
  • Waiting for the work on each guest to finish after stopping new VMs on the hypervisor
  • Changing the flag and reloading the module
  • Enabling new work for the hypervisor
According to the HS06 tests, this should lead to a reasonable performance improvement based on the results of the benchmark and tuning. However, certain users reported significantly worse performance than previously. In particular, some workloads showed significant differences in the following before and after characteristics.

Before the workload was primarily CPU bound, spending most of its time in user space. CERN applications have to process significant amounts of data so it is not always possible to ensure 100% utilisation but the aim is to provide the workload with user space CPU.


When EPT was turned off. some selected hypervisors showed a very difference performance profile. A major increase in non-user load and a reduction in the throughput for the experiment workloads. However, this effect was not observed on the servers with AMD processors.


With tools such as perf, we were able to trace the time down to handling the TLB misses. Perf gives

78.75% [kernel] [k] _raw_spin_lock
6.76% [kernel] [k] set_spte
1.97% [kernel] [k] memcmp
0.58% [kernel] [k] vmx_vcpu_run
0.46% [kernel] [k] ksm_docan
0.44% [kernel] [k] vcpu_enter_guest

The process behind the _raw_spin_lock is qemu-kvm.

Using systemtap kernel backtraces, we see mostly page faults and spte_* commands (shadow page table updates)
Both of these should not be necessary if you have hardware support for address translation: aka EPT.

There may be specific application workloads where the EPT setting was non optimal. In the worst case, the performance was several times slower.  EPT/NPT increases the cost of doing page table walks when the page is not cached in the TLB. This document shows how processors can speed up page walks - http://www.cs.rochester.edu/~sandhya/csc256/seminars/vm_yuxin_yanwei.pdf and AMD includes a page walk cache in their processor which speeds up the walking of pages as described in this paper  http://vglab.cse.iitd.ac.in/~sbansal/csl862-virt/readings/p26-bhargava.pdf

In other words, EPT slows down HS06 results when there are small pages involved because the HS06 benchmarks miss the TLB a lot. NPT doesn't slow it down because AMD has a page walk cache to help speed up finding the pages when it's not in the TLB. EPT comes good again when we have large pages because it rarely results in a TLB miss. So, HS06 is probably representative of most of the job types, but the is a small share of jobs which are different and triggered the above-mentioned problem.

However, we have 6% overhead compared to previous runs due to EPT on for the benchmark as mentioned in the previous blog. Mitigating the EPT overheads following the comments on the previous blog, we looked into using dedicated Huge Pages. Our hypervisors run CentOS 7 and thus support both transparent huge pages and huge pages. Transparent huge pages performs a useful job under normal circumstances but are opportunistic in nature. They are also limited to 2MB and cannot use the 1GB maximum size.

We tried setting the default huge page to 1G using the Grub cmdline configuration.

$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
$ cat /boot/grub2/grub.cfg | grep hugepage
linux16 /vmlinuz-3.10.0-229.11.1.el7.x86_64 root=UUID=7d5e2f2e-463a-4842-8e11-d6fac3568cf4 ro rd.md.uuid=3ff29900:0eab9bfa:ea2a674d:f8b33550 rd.md.uuid=5789f86e:02137e41:05147621:b634ff66 console=tty0 nodmraid crashkernel=auto crashkernel=auto rd.md.uuid=f6b88a6b:263fd352:c0c2d7e6:2fe442ac vconsole.font=latarcyrheb-sun16 vconsole.keymap=us LANG=en_US.UTF-8 default_hugepagesz=1G hugepagesz=1G hugepages=55 transparent_hugepage=never
$ cat /sys/module/kvm_intel/parameters/ept
Y

It may also be advisable to disable tuned for the moment until the bug #1189868 is resolved.

We also configured the XML manually to include the necessary huge pages. This will be available as a flavor or image option when we upgrade to Kilo in a few weeks.

  <memoryBacking>
        <hugepages>
          <page size="1" unit="G" nodeset="0-1"/>
        </hugepages>
  </memoryBacking>

The hypervisor was configured with huge pages enabled. However, we saw a problem with the distribution of huge pages across the NUMA nodes.

$ cat /sys/devices/system/node/node*/meminfo | fgrep Huge
Node 0 AnonHugePages: 311296 kB
Node 0 HugePages_Total: 29
Node 0 HugePages_Free: 0
Node 0 HugePages_Surp: 0
Node 1 AnonHugePages: 4096 kB
Node 1 HugePages_Total: 31
Node 1 HugePages_Free: 2

Node 1 HugePages_Surp: 0

This shows that the pages were not evenly distributed across the NUMA nodes., which would lead to subsequent performance issues. The suspicion is that the Linux boot up sequence led to some pages being used and this made it difficult to find contiguous blocks of 1GB for the huge pages. This led us to deploy 2MB pages rather than 1GB for the moment, while may not be the optimum setting allows better optimisations than the 4K settings and still gives some potential for KSM to benefit. These changes had a positive effect as the monitoring below shows when the reduction in system time.




At the OpenStack summit in Tokyo, we'll be having a session on Hypervisor Tuning so people are welcome to bring their experiences along and share the various options. Details of the session will appear at https://etherpad.openstack.org/p/TYO-ops-meetup.

Contributions from Ulrich Schwickerath and Arne Wiebalck (CERN) and Sean Crosby (University of Melbourne) have been included in this article along with the help of the LHC experiments to validate the configuration.


References