Thursday, 29 September 2016

Hyperthreading in the cloud

The cloud at CERN is used for a variety of different purposes from running personal VMs for development/test, bulk throughput computing to analyse the data from the Large Hadron Collider to long running services for the experiments and the organisation.

The configuration of many of the hypervisors is carefully tuned to maximise the compute throughput, i.e. getting as much compute work done in a given time rather than optimising the individual job performance. Many of the workloads are also nearly all embarrassingly parallel, i.e. each unit of compute can be run without needing to communicate with other jobs. A few workloads, such as QCD, need classical High Performance Computing but these are running on dedicated clusters with Infiniband interconnect compared to the typical 1Gbit/s or 10Gbit/s ethernet for the typical hypervisor.

CERN has a public procurement procedure which awards tenders to the bid with the lowest price for a given throughput compliant with the specifications.  The typical CERN hardware configuration is based on a dual socket configuration and must have at least 2GB/core.

Intel provides a capability for doubling the number of cores on the underlying processor called Simultaneous multithreading or SMT. From the machine perspective, this appears as double the number of cores compared to non-SMT configurations. Enabling SMT requires a BIOS parameter change so resources need to be defined in advance and appropriate capacity planning to define the areas of the cloud which are SMT on or off statically.

The second benefit of an SMT off configuration is the memory per core doubles. A server with 32 SMT on cores and 64GB of memory with hyper-threading has 2GB per core. A change to 16 cores by dropping SMT leads to 4GB per core which can be useful for some workloads.

Setting the BIOS parameters for a subset of the hypervisors causes multiple difficulties
  • With older BIOSes, this is a manual operation. New tools are available on the most recent hardware so this is an operation which can be performed with a Linux program and a reboot.
  • A motherboard replacement requires that the operation is repeated. This can be overlooked as part of the standard repair activities.
  • Capacity planning requires allocation of appropriate blocks of servers. At CERN, we use OpenStack cells to allow the cloud to scale to our needs with each cells having a unique hardware configuration such as particular processor/memory configuration and thus dedicated cells need to be created for the SMT off machines. When these capacities are exceeded, the other unused cloud resources cannot be trivially used but further administration reconfiguration is required.
The reference benchmark for High Energy Physics is HEPSpec06, a subset of the Spec benchmarks which match the typical instruction workload. Using this, run in parallel on each of the cores in a machine, the throughput provided by a given configuration can be measured.

SMT VM configuration Throughput HS06
On2 VMs each 16 cores351
On4 VMs each 8 cores355
Off1 VM of 16 cores284.5

Thus, the total throughput of the server with SMT off is significantly less (284.5 compared to 351) but the individual core performance is higher (284.5/16=17.8 compared to 351/32=11). Where an experiment workflow is serialised for some of the steps, this higher single core performance was a significant gain, but at an operational cost.

To find a cheaper approach, the recent additions of NUMA flavors in OpenStack was used. The hypervisors were configured with SMT on but a flavor was created to only use half of the cores on the server with 4GB/core so that the hypervisors were under committed on cores but committed by memory to avoid another VM being allocated to the unused cores. In our configuration, this was done by adding numa_nodes=2 to the flavor and the NUMA aware scheduler does the appropriate allocation.

This configuration was benchmarked and compared with the SMT On/Off.

SMT VM configuration Throughput HS06
On2 VMs each 16 cores351
Off1 VM of 16 cores284.5
On1 VM of 16 cores with numa_nodes=2283

The new flavor shows similar characteristics to the SMT Off configuration without requiring the BIOS setting change and can therefore be deployed without needing the configuration of dedicated cells with a particular hardware configuration. The Linux and OpenStack schedulers appear to be allocating the appropriate distribution of cores across the processors.