LHC Tunnel

LHC Tunnel

Tuesday, 30 January 2018

Keep calm and reboot: Patching recent exploits in a production cloud

At CERN, we have around 8,500 hypervisors running 36,000 guest virtual machines. These provide the compute resources for both the laboratory's physics program but also for the organisation's administrative operations such as paying bills and reserving rooms at the hostel. These resources are spread over many different server configurations, some of them over 5 years old.

With the accelerator stopping over the CERN annual closure until mid March, this is a good period to be planning reconfiguration of compute resources such as the migration of our central batch system which schedules the jobs across the central compute resources to a new system based on HTCondor. The compute resources are heavily used but there is more flexibility to drain some parts in the quieter periods of the year when there is not 10PB/month coming from the detectors. However, this year we have had an unexpected additional task to deploy the fixes for the Meltdown and Spectre exploits across the centre.

The CERN environment is based on Scientific Linux CERN 6 and CentOS 7. The hypervisors are now entirely CentOS 7 based with guests of a variety of operating systems including Windows flavors and CERNVM. The campaign to upgrade involved a number of steps
  • Assess the security risk
  • Evaluate the performance impact
  • Test the upgrade procedure and stability
  • Plan the upgrade campaign
  • Communicate with the users
  • Execute the campaign

Security Risk

The CERN environment consists of a mixture of different services, with thousands of projects on the cloud, distributed across two data centres in Geneva and Budapest. 

Two major risks were identified
  • Services which provided the ability for end users to run their own programs along with others sharing the same kernel. Examples of this are the public login services and batch farms. Public login services provide an interactive Linux environment for physicists to log into from around the world, prepare papers, develop and debug applications and submit jobs to the central batch farms. The batch farms themselves provide 1000s of worker nodes processing the data from CERN experiments by farming event after event to free compute resources. Both of these environments are multi-user and allow end users to compile their own programs and thus were rated as high risk for the Meltdown exploit.
  • The hypervisors provide support for a variety of different types of virtual machines. Different areas of the cloud provide access to different network domains or to compute optimised configurations. Many of these hypervisors will have VMs owned by different end users and therefore can be exposed to the Spectre exploits, even if the performance is such that exploiting the problem would take significant computing time.
The remaining VMs are for dedicated services without access for end user applications or dedicated bare metal servers for I/O intensive applications such as databases and disk or tape servers.

There are a variety of different hypervisor configurations which we split down by processor type (in view of the Spectre microcode patches). Each of these needs independent performance and stability checks.


Microcode
Assessment
#HVs
Processor name(s)
06-3f-02
covered
3332
E5-2630 v3 @ 2.40GHz,E5-2640 v3 @ 2.60GHz
06-4f-01
covered
2460
E5-2630 v4 @ 2.20GHz, E5-2650 v4 @ 2.20GHz
06-3e-04
hopefully
1706
E5-2650 v2 @ 2.60GHz
??
unclear
427
CPU family: 21 Model: 1 Model name: AMD Opteron(TM) Processor 6276 Stepping: 2
06-2d-07
unclear
333
E5-2630L 0 @ 2.00GHz, E5-2650 0 @ 2.00GHz
06-2c-02
unlikely
168
E5645 @ 2.40GHz, L5640 @ 2.27GHz, X5660 @ 2.80GHz

These risks were explained by the CERN security team to the end users in their regular blogs.

Evaluating the performance impact

The High Energy Physics community uses a suite called HEPSPEC06 to benchmark compute resources. These are synthetic programs based on the C++ components of SPEC CPU2006 which match the instruction mix of the typical physics programs. With this benchmark, we have started to re-benchmark (the majority of) the CPU models we have in the data centres, both on the physical hosts and on the guests. The measured performance loss across all architectures tested so far is about 2.5% in HEPSPEC06 (a number also confirmed by by one of the LHC experiments using their real workloads) with a few cases approaching 7%. So for our physics codes, the effect of patching seems measurable, but much smaller than many expected. 

Test the upgrade procedure and stability

With our environment based on CentOS and Scientific Linux, the deployment of the updates for Meltdown and Spectre were dependent on the upstream availability of the patches. These could be broken down into several parts
  • Firmware for the processors - the microcode_ctl packages provide additional patches to protect against some parts of Spectre. This package proved very dynamic as new processor firmware was being added on a regular basis and it was not always clear when this needed to be applied, the package version would increase but it was not always that this included an update for the particular hardware type. Following through the Intel release notes,  there were combinations such as "HSX C0(06-3f-02:6f) 3a->3b" which explains that the processor description 06-3f-02:6f is upgraded from release 0x3a to 0x3b. The fields are the CPU family, model and stepping from /proc/cpuinfo and the firmware level can be found at /sys/devices/system/cpu/cpu0/microcode/version. A simple script (spectre-cpu-microcode-checker.sh) was made available to the end users so they could check their systems and this was also used by the administrators to validate the central IT services.
  • For the operating system, we used a second script (spectre-meltdown-checker.sh) which was derived from the upstream github code at https://github.com/speed47/spectre-meltdown-checker.  The team maintaining this package were very responsive incorporating our patches so that other sites could benefit from the combined analysis.

Communication with the users

For the cloud, there are several resource consumers.
  • IT service administrators who provide higher level functions on top of the CERN cloud. Examples include file transfer services, information systems, web frameworks and experiment workload management systems. While some are in the IT department, others are representatives of their experiments or supporters for online control systems such as those used to manage the accelerator infrastructure.
  • End users consume cloud resources by asking for virtual machines and using them as personal working environments. Typical cases would be a MacOS user who needs a Windows desktop where they would create a Windows VM and use protocols such as RDP to access it when required.
The communication approach was as follows:
  • A meeting was held to discuss the risks of exploits, the status of the operating systems and the plan for deployment across the production facilities. With a Q&A session, the major concerns raised were around potential impact on performance and tuning options. 
  • An e-mail was sent to all owners of virtual machine resources informing them of the upcoming interventions.
  • CERN management was informed of the risks and the plan for deployment.
CERN uses ServiceNow to provide a service desk for tickets and a status board of interventions and incidents. A single entry was used to communicate the current plans and status so that all cloud consumers could go to a single place for the latest information.

Execute the campaign

With the accelerator starting up again in March and the risk of the exploits, the approach taken was to complete the upgrades to the infrastructure in January, leaving February to find any residual problems and resolve them. As the handling of the compute/batch part of the infrastructure was relatively straight forward (with only one service on top), we will focus in the following on the more delicate part of hypervisors running services supporting several thousand users in their daily work.

The layout of our infrastructure with its availability zones (AVZs) determined the overall structure and timeline of the upgrade. With effectively four AVZs in our data centre in Geneva and two AVZs for our remote resources in Budapest, we scheduled the upgrade for the services part of the resources over four days.


The main zones in Geneva were done one per day, with a break after the first one (GVA-A) in case there were unexpected difficulties to handle on the infrastructure or on the application side. The remaining zones were scheduled on consecutive days (GVA-B and GVA-C), the smaller ones (critical, WIG-A, WIG-B) in sequential order on the last day. This way we upgraded around 400 hosts with 4,000 guests per day.

Within each zone, hypervisors were divided into 'reboot groups' which were restarted and checked before the next group was handled. These groups were determined by the OpenStack cells underlying the corresponding AVZs. Since some services required to limit the window of service downtime, their hosting servers were moved to the special Group 1, the only one for which we could give a precise start time.

For each group several steps were performed:
  • install all relevant packages
  • check the next kernel is the desired one
  • reset the BMC (needed for some specific hardware to prevent boot problems)
  • log the nova and ping state of all guests
  • stop all alarming 
  • stop nova
  • shut down all instances via virsh
  • reboot the hosts
  • ... wait ... then fix hosts which did not come back
  • check running kernel and vulnerability status on the rebooted hosts
  • check and fix potential issues with the guests
Shutting down virtual machines via 'virsh', rather than the OpenStack APIs, was chosen to speed up the overall process -- even if this required to switch off nova-compute on the hosts as well (to keep nova in a consistent state). An alternative to issuing 'virsh' commands directly would be to configure 'libvirt-guests', especially in the context of the question whether guests should be shut down and rebooted (which we did during this campaign) or paused/resumed. This is an option we'll have a look at to prepare for similar campaigns in the future.

As some of the hypervisors in the cloud had very long uptimes and this was the first time we systematically rebooted the whole infrastructure since the service went to full production about five years ago, we were not quite sure what kind issues to expect -- and in particular at which scale. To our relief, the problems encountered on the hosts hit less than 1% of the servers and included (in descending order of appearance)
  • hosts stuck in shutdown (solved by IPMI reset)
  • libvirtd stuck after reboot (solved by another reboot)
  • hosts without network connectivity (solved by another reboot)
  • hosts stuck in grub during boot (solved by reinstalling grub) 
On the guest side, virtual machines were mostly ok when the underlying hypervisor was ok as well.
A few additional cases included
  • incomplete kernel upgrades, so the root partition could not be found (solved by booting back into an older kernel and reinstall the desired kernel)
  • file system issues (solved by running file system repairs)
So, despite initial worries, we hit no major issues when rebooting the whole CERN cloud infrastructure!

Conclusions

While these kind of security issues do not arrive very often, the key parts of the campaign follow standard steps, namely assessing the risk, planning the update, communicating with the user community, execution and handling incomplete updates.

Using cloud availability zones to schedule the deployment allowed users to easily understand when there would be an impact on their virtual machines and encourages good practise to load balance resources.

References

Authors

  • Arne Wiebalck
  • Jan Van Eldik
  • Tim Bell

33 comments:

  1. Amazing post with lots of usefull information. Well done!. Keep Posting check out similar website like this..
    hacksaw blades manufacturing pakistan

    ReplyDelete


  2. I am looking for and I love to post a comment Python training in punethat "The content of your post is awesome" Great work!

    ReplyDelete

  3. Hi to everybody, here everyone is sharing such knowledge, so it’s fastidious to see this site, and I used to visit this blog daily. ExcelR Data Science Courses

    ReplyDelete
  4. This is a wonderful article, Given so ExcelR Python classes in pune much info in it, These type of articles keeps the users interest in the website, and keep on sharing more ... good luck.

    ReplyDelete
  5. LifeVoxel AI - Best Medical Imaging Platform also Provide critical patient-centric delivery of care using intuitive, diagnostic and instantly accessible visualizations over the Internet. 

    RIS PACS
    RIS PACS Software

    ReplyDelete
  6. Nice Post and informative data. Thank you so much for sharing this good post, it was so nice to read and useful to improve my knowledge as updated one, keep blogging.
    Open Stack Training in Electronic City

    ReplyDelete
  7. Very interesting information and very useful topic. I have information regarding data science course in Chennai.
    data-science training
    Data-Analytics course
    business analytics -Python course

    ReplyDelete
  8. Really awesome blog!!! I finally found a great post here.I really enjoyed reading this article. Thanks for sharing valuable information.
    Data Science Course in Marathahalli
    Data Science Course Training in Bangalore

    ReplyDelete
  9. Attend The Data Science Training Bangalore From ExcelR. Practical Data Science Training Bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Science Training Bangalore.
    Data Science Training Bangalore
    Data Science Interview Questions

    ReplyDelete
  10. wonderful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article resolved my all queries.
    Data science Interview Questions

    ReplyDelete
  11. I feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it.

    data science course

    ReplyDelete
  12. wonderful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article resolved my all queries. keep it up.
    data analytics course in Bangalore

    ReplyDelete
  13. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspried me to read more. keep it up.
    Correlation vs Covariance

    ReplyDelete
  14. I like viewing web sites which comprehend the price of delivering the excellent useful resource free of charge. I truly adored reading your posting. Thank you!

    Correlation vs Covariance

    ReplyDelete
  15. I am looking for and I love to post a comment that "The content of your post is awesome" Great work!

    Simple Linear Regression

    Correlation vs Covariance

    ReplyDelete
  16. Very interesting blog. Many blogs I see these days do not really provide anything that attracts others, but believe me the way you interact is literally awesome.You can also check my articles as well.

    Data Science In Banglore With Placements
    Data Science Course In Bangalore
    Data Science Training In Bangalore
    Best Data Science Courses In Bangalore
    Data Science Institute In Bangalore

    Thank you..

    ReplyDelete
  17. I am looking for and I love to post a comment that "The content of your post is awesome" Great work!

    data science interview questions

    ReplyDelete
  18. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Correlation vs Covariance
    Simple linear regression
    data science interview questions

    ReplyDelete
  19. This Was An Amazing ! I Haven't Seen This Type of Blog Ever ! Thankyou For Sharing, data science training

    ReplyDelete
  20. This Was An Amazing ! I Haven't Seen This Type of Blog Ever ! Thankyou For Sharing, data science training

    ReplyDelete

  21. I have recently visited your blog profile. I am totally impressed by your blogging skills and knowledge.
    Data Science Course in Hyderabad

    ReplyDelete
  22. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Correlation vs Covariance
    Simple linear regression
    data science interview questions

    ReplyDelete
  23. This is a wonderful article, Given so much info in it, These type of articles keeps the users interest in the website, and keep on sharing more ... good luck.

    Simple Linear Regression

    Correlation vs Covariance

    ReplyDelete
  24. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Correlation vs Covariance
    Simple linear regression
    data science interview questions

    ReplyDelete
  25. Attend The Data Analyst Course From ExcelR. Practical Data Analyst Course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analyst Course.
    Data Analyst Course

    ReplyDelete

  26. Very interesting blog Thank you for sharing such a nice and interesting blog and really very helpful article.
    Data Science Course in Hyderabad

    ReplyDelete
  27. Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites!

    data science interview questions

    ReplyDelete
  28. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Correlation vs Covariance
    Simple linear regression
    data science interview questions

    ReplyDelete