LHC Tunnel

LHC Tunnel

Tuesday, 30 January 2018

Keep calm and reboot: Patching recent exploits in a production cloud

At CERN, we have around 8,500 hypervisors running 36,000 guest virtual machines. These provide the compute resources for both the laboratory's physics program but also for the organisation's administrative operations such as paying bills and reserving rooms at the hostel. These resources are spread over many different server configurations, some of them over 5 years old.

With the accelerator stopping over the CERN annual closure until mid March, this is a good period to be planning reconfiguration of compute resources such as the migration of our central batch system which schedules the jobs across the central compute resources to a new system based on HTCondor. The compute resources are heavily used but there is more flexibility to drain some parts in the quieter periods of the year when there is not 10PB/month coming from the detectors. However, this year we have had an unexpected additional task to deploy the fixes for the Meltdown and Spectre exploits across the centre.

The CERN environment is based on Scientific Linux CERN 6 and CentOS 7. The hypervisors are now entirely CentOS 7 based with guests of a variety of operating systems including Windows flavors and CERNVM. The campaign to upgrade involved a number of steps
  • Assess the security risk
  • Evaluate the performance impact
  • Test the upgrade procedure and stability
  • Plan the upgrade campaign
  • Communicate with the users
  • Execute the campaign

Security Risk

The CERN environment consists of a mixture of different services, with thousands of projects on the cloud, distributed across two data centres in Geneva and Budapest. 

Two major risks were identified
  • Services which provided the ability for end users to run their own programs along with others sharing the same kernel. Examples of this are the public login services and batch farms. Public login services provide an interactive Linux environment for physicists to log into from around the world, prepare papers, develop and debug applications and submit jobs to the central batch farms. The batch farms themselves provide 1000s of worker nodes processing the data from CERN experiments by farming event after event to free compute resources. Both of these environments are multi-user and allow end users to compile their own programs and thus were rated as high risk for the Meltdown exploit.
  • The hypervisors provide support for a variety of different types of virtual machines. Different areas of the cloud provide access to different network domains or to compute optimised configurations. Many of these hypervisors will have VMs owned by different end users and therefore can be exposed to the Spectre exploits, even if the performance is such that exploiting the problem would take significant computing time.
The remaining VMs are for dedicated services without access for end user applications or dedicated bare metal servers for I/O intensive applications such as databases and disk or tape servers.

There are a variety of different hypervisor configurations which we split down by processor type (in view of the Spectre microcode patches). Each of these needs independent performance and stability checks.


Microcode
Assessment
#HVs
Processor name(s)
06-3f-02
covered
3332
E5-2630 v3 @ 2.40GHz,E5-2640 v3 @ 2.60GHz
06-4f-01
covered
2460
E5-2630 v4 @ 2.20GHz, E5-2650 v4 @ 2.20GHz
06-3e-04
hopefully
1706
E5-2650 v2 @ 2.60GHz
??
unclear
427
CPU family: 21 Model: 1 Model name: AMD Opteron(TM) Processor 6276 Stepping: 2
06-2d-07
unclear
333
E5-2630L 0 @ 2.00GHz, E5-2650 0 @ 2.00GHz
06-2c-02
unlikely
168
E5645 @ 2.40GHz, L5640 @ 2.27GHz, X5660 @ 2.80GHz

These risks were explained by the CERN security team to the end users in their regular blogs.

Evaluating the performance impact

The High Energy Physics community uses a suite called HEPSPEC06 to benchmark compute resources. These are synthetic programs based on the C++ components of SPEC CPU2006 which match the instruction mix of the typical physics programs. With this benchmark, we have started to re-benchmark (the majority of) the CPU models we have in the data centres, both on the physical hosts and on the guests. The measured performance loss across all architectures tested so far is about 2.5% in HEPSPEC06 (a number also confirmed by by one of the LHC experiments using their real workloads) with a few cases approaching 7%. So for our physics codes, the effect of patching seems measurable, but much smaller than many expected. 

Test the upgrade procedure and stability

With our environment based on CentOS and Scientific Linux, the deployment of the updates for Meltdown and Spectre were dependent on the upstream availability of the patches. These could be broken down into several parts
  • Firmware for the processors - the microcode_ctl packages provide additional patches to protect against some parts of Spectre. This package proved very dynamic as new processor firmware was being added on a regular basis and it was not always clear when this needed to be applied, the package version would increase but it was not always that this included an update for the particular hardware type. Following through the Intel release notes,  there were combinations such as "HSX C0(06-3f-02:6f) 3a->3b" which explains that the processor description 06-3f-02:6f is upgraded from release 0x3a to 0x3b. The fields are the CPU family, model and stepping from /proc/cpuinfo and the firmware level can be found at /sys/devices/system/cpu/cpu0/microcode/version. A simple script (spectre-cpu-microcode-checker.sh) was made available to the end users so they could check their systems and this was also used by the administrators to validate the central IT services.
  • For the operating system, we used a second script (spectre-meltdown-checker.sh) which was derived from the upstream github code at https://github.com/speed47/spectre-meltdown-checker.  The team maintaining this package were very responsive incorporating our patches so that other sites could benefit from the combined analysis.

Communication with the users

For the cloud, there are several resource consumers.
  • IT service administrators who provide higher level functions on top of the CERN cloud. Examples include file transfer services, information systems, web frameworks and experiment workload management systems. While some are in the IT department, others are representatives of their experiments or supporters for online control systems such as those used to manage the accelerator infrastructure.
  • End users consume cloud resources by asking for virtual machines and using them as personal working environments. Typical cases would be a MacOS user who needs a Windows desktop where they would create a Windows VM and use protocols such as RDP to access it when required.
The communication approach was as follows:
  • A meeting was held to discuss the risks of exploits, the status of the operating systems and the plan for deployment across the production facilities. With a Q&A session, the major concerns raised were around potential impact on performance and tuning options. 
  • An e-mail was sent to all owners of virtual machine resources informing them of the upcoming interventions.
  • CERN management was informed of the risks and the plan for deployment.
CERN uses ServiceNow to provide a service desk for tickets and a status board of interventions and incidents. A single entry was used to communicate the current plans and status so that all cloud consumers could go to a single place for the latest information.

Execute the campaign

With the accelerator starting up again in March and the risk of the exploits, the approach taken was to complete the upgrades to the infrastructure in January, leaving February to find any residual problems and resolve them. As the handling of the compute/batch part of the infrastructure was relatively straight forward (with only one service on top), we will focus in the following on the more delicate part of hypervisors running services supporting several thousand users in their daily work.

The layout of our infrastructure with its availability zones (AVZs) determined the overall structure and timeline of the upgrade. With effectively four AVZs in our data centre in Geneva and two AVZs for our remote resources in Budapest, we scheduled the upgrade for the services part of the resources over four days.


The main zones in Geneva were done one per day, with a break after the first one (GVA-A) in case there were unexpected difficulties to handle on the infrastructure or on the application side. The remaining zones were scheduled on consecutive days (GVA-B and GVA-C), the smaller ones (critical, WIG-A, WIG-B) in sequential order on the last day. This way we upgraded around 400 hosts with 4,000 guests per day.

Within each zone, hypervisors were divided into 'reboot groups' which were restarted and checked before the next group was handled. These groups were determined by the OpenStack cells underlying the corresponding AVZs. Since some services required to limit the window of service downtime, their hosting servers were moved to the special Group 1, the only one for which we could give a precise start time.

For each group several steps were performed:
  • install all relevant packages
  • check the next kernel is the desired one
  • reset the BMC (needed for some specific hardware to prevent boot problems)
  • log the nova and ping state of all guests
  • stop all alarming 
  • stop nova
  • shut down all instances via virsh
  • reboot the hosts
  • ... wait ... then fix hosts which did not come back
  • check running kernel and vulnerability status on the rebooted hosts
  • check and fix potential issues with the guests
Shutting down virtual machines via 'virsh', rather than the OpenStack APIs, was chosen to speed up the overall process -- even if this required to switch off nova-compute on the hosts as well (to keep nova in a consistent state). An alternative to issuing 'virsh' commands directly would be to configure 'libvirt-guests', especially in the context of the question whether guests should be shut down and rebooted (which we did during this campaign) or paused/resumed. This is an option we'll have a look at to prepare for similar campaigns in the future.

As some of the hypervisors in the cloud had very long uptimes and this was the first time we systematically rebooted the whole infrastructure since the service went to full production about five years ago, we were not quite sure what kind issues to expect -- and in particular at which scale. To our relief, the problems encountered on the hosts hit less than 1% of the servers and included (in descending order of appearance)
  • hosts stuck in shutdown (solved by IPMI reset)
  • libvirtd stuck after reboot (solved by another reboot)
  • hosts without network connectivity (solved by another reboot)
  • hosts stuck in grub during boot (solved by reinstalling grub) 
On the guest side, virtual machines were mostly ok when the underlying hypervisor was ok as well.
A few additional cases included
  • incomplete kernel upgrades, so the root partition could not be found (solved by booting back into an older kernel and reinstall the desired kernel)
  • file system issues (solved by running file system repairs)
So, despite initial worries, we hit no major issues when rebooting the whole CERN cloud infrastructure!

Conclusions

While these kind of security issues do not arrive very often, the key parts of the campaign follow standard steps, namely assessing the risk, planning the update, communicating with the user community, execution and handling incomplete updates.

Using cloud availability zones to schedule the deployment allowed users to easily understand when there would be an impact on their virtual machines and encourages good practise to load balance resources.

References

Authors

  • Arne Wiebalck
  • Jan Van Eldik
  • Tim Bell

70 comments:



  1. I am looking for and I love to post a comment Python training in punethat "The content of your post is awesome" Great work!

    ReplyDelete

  2. Hi to everybody, here everyone is sharing such knowledge, so it’s fastidious to see this site, and I used to visit this blog daily. ExcelR Data Science Courses

    ReplyDelete
  3. Nice Post and informative data. Thank you so much for sharing this good post, it was so nice to read and useful to improve my knowledge as updated one, keep blogging.
    Open Stack Training in Electronic City

    ReplyDelete
  4. Very interesting information and very useful topic. I have information regarding data science course in Chennai.
    data-science training
    Data-Analytics course
    business analytics -Python course

    ReplyDelete

  5. I have recently visited your blog profile. I am totally impressed by your blogging skills and knowledge.
    Data Science Course in Hyderabad

    ReplyDelete

  6. Very interesting blog Thank you for sharing such a nice and interesting blog and really very helpful article.
    Data Science Course in Hyderabad

    ReplyDelete
  7. I see the greatest contents on your blog and I extremely love reading them. ExcelR Data Science Courses

    ReplyDelete
  8. Data scientist certification was never so easy and adaptable to everyone but here at Excelr We teach you numerous ways of doing Data Science Courses, which are way easy and interesting. Our experienced and expert faculty will help you reach your goal. 100% result oriented strategies are being performed; we offer Data Science Course in pune

    Data scientist certification

    ReplyDelete
  9. It's really nice and meanful. it's really cool blog. Linking is very useful thing.you have really helped lots of people who visit blog and provide them usefull information. seo gatineau

    ReplyDelete
  10. I will really appreciate the writer's choice for choosing this excellent article appropriate to my matter.Here is deep description about the article matter which helped me more.

    Data Science Course in Mysore

    ReplyDelete
  11. Great blog here with all of the valuable information you have. Keep up the good work you are doing here 경마

    ReplyDelete
  12. This is my first time visit to your blog and I am very interested in the articles that you serve. Provide enough knowledge for me. Thank you for sharing useful and don't forget, keep sharing useful info:
    Trucchi The Sims
    انترنت داونلود مانجر مع الكراك
    Kody Do The Sims
    Triche The Sims 4
    The Sims 4 Mod
    The Sims 4 Cheats
    Mody Do The Sims 4

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. Loved the content! You have mentioned every aspect of the subject through this article, can you please write about "low cost seo services for small business"?

    ReplyDelete
  15. Thanks for sharing this marvelous post. I m very pleased to read this article.토토사이트

    ReplyDelete
  16. Way cool! Some extremely valid points! I appreciate you penning this write-up plus the rest of the site is very good.Click Here오피월드


    2YOUNGYANG

    ReplyDelete
  17. repeat post! I am actually getting ready to across this information, is very helpful my friend. Also great blog here with all of the valuable information you have. Keep up the good work you are doing here. 국산야동

    Also feel free to visit may web page 야설

    ReplyDelete
  18. It is imperative that we read blog post very carefully. I am already done it and find that this post is really amazing.
    data science course

    ReplyDelete
  19. I have read your article, it is very informative and helpful for me.I admire the valuable information you offer in your articles. Thanks for posting it..
    cloud computing in hyderabad

    ReplyDelete
  20. You’re a very skilled blogger. I have joined your feed and look forward to seeking more of your fantastic post. 사설토토

    ReplyDelete
  21. I should say only that its awesome! The blog is informational and always produce amazing things 카지노

    ReplyDelete
  22. The great destinations information is truly educated. The pleasant destinations alluded was acceptable information 파워볼

    ReplyDelete
  23. your work was exemplary. The information provided was very helpful and articulate. Keep recording. 온라인카지노

    ReplyDelete
  24. Thanks a lot for one’s intriguing write-up. It’s actually exceptional. Searching ahead for this sort of revisions.
    best digital marketing training institute in hyderabad

    ReplyDelete
  25. ove to read it,Waiting For More new Update and I Already Read your Recent Post its Great Thank..great article. thank you for sharin..Great post. Thank You For Sharing Valuable. information. It is Very Informative article..I urge you to peruse this content it is fun portrayed .. 토토안전센터

    ReplyDelete
  26. Great post, please keep on sharing amazing article like this! It makes me happy reading your post..Interesting and interesting information can be found on this topic here profile worth to see it.You are so intriguing! I don't assume I've genuinely perused anything like that previously. So extraordinary to discover someone for certain authentic considerations on this issue. Truly.. much obliged to you for firing this up. 카지노커뮤니티

    ReplyDelete
  27. That is a great tip especially to those new to the blogosphere..Simple but very precise information… Thank you for sharing this one...A must read post!..hello!,I love your writing very a lot! percentage we communicate extra approximately your post on AOL? 안전놀이터

    ReplyDelete
  28. Hello there, I have to say it was a really terrific experience for me when I dropped by at your website. I hope you don't mind if I praise you on the superior quality of your work and to send you all the best with it as you advance in the future. It was a pleasure to browse your web site and I shall definitely be calling back again shortly to discover how you are doing. Thanks a ton and I will no doubt see you here again soon  먹튀프렌즈

    ReplyDelete
  29. I really loved reading your blog. It was very well authored and easy to understand..You have a very nice blog. Thank you for sharing..A very awesome blog post. We are really grateful for your blog post. You will find a lot of approaches after visiting your pos 토토패밀리

    ReplyDelete
  30. I’m very pleased to find this site. I need to to thank you for ones time due to this wonderful read!! I definitely really liked every little bit of it and i also have you saved as a favorite to check out new stuff on your site. I’m very pleased to find this site. I need to to thank you for ones time due to this wonderful read!! I definitely really liked every little bit of it and i also have you saved as a favorite to check out new stuff on your site. 모두의토토

    ReplyDelete
  31. This is such a great resource that you are providing and you give it away for free. I love seeing blog that understand the value of providing a quality resource for free . Wow! Such an amazing and helpful post this is. I really really love it. It's so good and so awesome. I am just amazed. I hope that you continue to do your work like this in the future also . Thanks for another wonderful post. Where else could anybody get that type of info in such an ideal way of writing? 토토서치

    ReplyDelete
  32. I read this article. I think You put a great deal of push to make this article. I value your work. I just idea it might be a plan to post incase any other person was having issues inquiring about yet I am somewhat uncertain in the event that I am permitted to put names and addresses on here. The site is affectionately adjusted and spared as much as date. So it ought to be, a debt of gratitude is in order for offering this to us. 카지노마트

    ReplyDelete
  33. I really appreciate the kind of topics you post here. Thanks for sharing us a great information that is actually helpful. Good day.Set aside my effort to peruse every one of the remarks, however, I truly delighted in the article. It ended up being Very useful to me and I am certain to all the analysts here. 카지노세상

    ReplyDelete
  34. Your site got my attention and shows me different perception for how we should boost our site. This is a really perfect for a new blogger like me who doesn't want their site to be messy with those spammers who don't even read your post but they have the guts to comment in your site. Thanks again. A big thanks for sharing this post by the way if anyone looking for Best Consulting Firm for Fake Experience Certificate Providers in hyderabad, India with Complete Documents So Dreamsoft Consultancy is the Best Place. 토토매거진

    ReplyDelete
  35. Superbly written article, if only all bloggers offered the same content as you, the internet would be a far better place.. 토토사이트

    ReplyDelete
  36. Superbly written article, if only all bloggers offered the same content as you, the internet would be a far better place..

    ReplyDelete
  37. I read a article under the same title some time ago, but this articles quality is much, much better. How you do this.. 토토사이트

    ReplyDelete
  38. I have bookmarked your site since this site contains significant data in it. You rock for keeping incredible stuff. I am a lot of appreciative of this site.

    ReplyDelete
  39. I truly like your style of blogging. I added it to my preferred's blog webpage list and will return soon…

    ReplyDelete
  40. Commercial Law Assignment Help is a large field of law that deals with the laws which govern trade and commerce. USA Students want to improve your marks in exam with A grades. Contact us our professional support team. Our writers are always help to 24*7 hours.

    ReplyDelete
  41. You really make it look so natural with your exhibition however I see this issue as really something which I figure I could never understand. It appears to be excessively entangled and incredibly expansive for me.
    business analytics training in hyderabad

    ReplyDelete
  42. exquisite post i would really like to thank you for the efforts you have made in scripting this interesting and knowledgeable article. Greetings to each one, it's surely a selected for me to visit this website page, it accommodates of helpful records. Just natural brilliance from you here. I've in no way anticipated some thing much less than this from you and you have not upset me in any respect. I think you may hold the nice work taking place. This is a notable article, given so much data in it, these sort of articles maintains the customers hobby inside the internet site, and maintain on sharing extra ... Correct luck. Thanks plenty for one’s fascinating write-up. It’s truly extremely good. Searching beforehand for this type of revisions. 먹튀검증커뮤니티

    ReplyDelete
  43. Thanks for the exceptional proportion. Your article has proved your hard work and experience you've got were given on this discipline. First rate . I like it studying. I’m certain i'm able to at closing make a move the use of your guidelines on those things i ought to by no means had been capable of touch by myself. You had been so modern to let me be one of these to gain out of your useful statistics. Please understand how a lot i'm thankful. In reality respectable submit. I simply located your blog and wished to mention that i have certainly cherished surfing around your weblog entries. Regardless i'll be subscribing for your nourish and i accept as true with you compose afresh soon! I'm in reality taking part in reading your properly written articles. It seems like you spend a whole lot of time and effort in your weblog. I've bookmarked it and i'm searching forward to studying new articles. That is extraordinarily exciting substance! I've absolutely favored perusing your focuses and feature arrived at the realization which you are ideal approximately a widespread lot of them. You're excellent. I am thrilled and lucky to return on in your web page, i truly preferred the top notch article in your web page. Thanks for this useful data. I additionally determined very thrilling statistics 토토SOS

    ReplyDelete
  44. hi, i assume that i saw you visited my web page so i got here to “return the prefer”. I am attempting to find matters to enhance my website! I think its good enough to use some of your ideas!! I’m genuinely happy i’ve located this records. In recent times bloggers put up only about gossips and net and this is honestly tense. An excellent weblog with interesting content, that is what i want. Thank you for retaining this web-website, i could be visiting it. Thank you loads for giving everyone remarkably marvellous chance to test guidelines from right here. It may be very first-rate and as nicely , full of a excellent time for me and my workplace friends to visit the blog 먹튀

    ReplyDelete
  45. I’m very pleased to find this site. I need to to thank you for ones time due to this wonderful read!!come my web site 토토사이트

    ReplyDelete
  46. I really loved reading your blog. It was very well authored and easy to understand..You have a very nice blog wow check my web site 먹튀검증

    ReplyDelete
  47. Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Anyway, I’ll be subscribing to your feed and I hope you post again soon.
    data analytics course in hyderabad

    ReplyDelete
  48. I really loved reading your blog. It was very well authored and easy to understand..You have a very nice blog. good toto info for you 토토사이트

    ReplyDelete
  49. A single person cannot do the entire job of a data analyst; it requires multiple people to divide the task between them for getting a quick decision.

    ReplyDelete
  50. Excellent way of describing, and nice post to take data concerning my presentation subject matter nice web info for you 토토사이트

    ReplyDelete
  51. I’m very pleased to find this site. I need to to thank you for ones time due to this wonderful read!! only nice web info for you 카지노사이트

    ReplyDelete
  52. I really loved reading your blog. It was very well authored and easy to understand..You have a very nice blog check my web site 안전놀이터

    ReplyDelete
  53. I am somewhat uncertain in the event that I am permitted to put names and addresses on here. check only nice web info for you 먹튀검증

    ReplyDelete
  54. Data Science is the next big thing in the IT industry. Start your career in Data Science with 360DigiTMG’s Data Science training program. Enroll now!
    data science institute in chennai

    ReplyDelete
  55. Omasta kokemuksesta jos tee-se-itse remontoijalle ohjeita listaisi, olisi alku seuraavanlainen.
    한국야동

    ReplyDelete
  56. For the cloud, there are several resource consumers.
    한국야동

    ReplyDelete